In this analysis we would like to predict the count of bike rentals in Washington DC with a regression analysis, Decision Trees, Support Vector Machines (SVMs) and Neural Networks (NN).
Before the project starts, all required packages are imported.
library(ggplot2)
library(e1071)
library(mgcv)
library(corrplot)
library(tree)
library(mgcv)
library(tidyverse)
library(boot)
library(MASS)
library(ipred)
library(randomForest)
library(gbm)
library(modelr)
library(neuralnet)
library(h2o)
library(bit64)
library(caret)
We downloaded the dataset “bikesharing.csv” from https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Bike%20Sharing.
In a first step the dataset is imported to R and stored in the data.frame d.bike:
d.bike <- read.csv("bikesharing.csv", header=TRUE)
The dataset contains hourly information about a day and its weather conditions in 17379 observations of 17 variables. In the following, the individual attributes will be explained:
In order to make a valid feature selection in the next step, we now want to investigate the correlation between the predictors. The so-called corrplot offers a simple and clear display. This shows a correlation matrix. Correlations are represented by dots. The exact value can be read off the scale.
predictors = c("dteday", "season", "yr","mnth","hr","holiday","weekday",
"workingday","weathersit","temp","atemp","hum","windspeed", "casual", "registered")
corrplot(cor(data.matrix(d.bike[predictors])))
As the corrplot shows, dteday correlates strongly with the predictor yr. Furthermore it can be seen that there is a correlation between dteday and season as well as mnth. There is also a relatively strong correlation between month and season.The strongest correlation of almost 1 shows temp and atemp.
We decide to delete the feature instant and dteday. instant is a record index that has been artificially created. Therefore it will have no relevant effect on cnt. dteday is the date of the record. It correlates with the three other attributes season, yr and mnth. The features season and mnth as well as temp and atem will be kept for now.
d.bike <- subset(d.bike, select=-c(instant, dteday))
head(d.bike)
## season yr mnth hr holiday weekday workingday weathersit temp atemp hum
## 1 1 0 1 0 0 6 0 1 0.24 0.2879 0.81
## 2 1 0 1 1 0 6 0 1 0.22 0.2727 0.80
## 3 1 0 1 2 0 6 0 1 0.22 0.2727 0.80
## 4 1 0 1 3 0 6 0 1 0.24 0.2879 0.75
## 5 1 0 1 4 0 6 0 1 0.24 0.2879 0.75
## 6 1 0 1 5 0 6 0 2 0.24 0.2576 0.75
## windspeed casual registered cnt
## 1 0.0000 3 13 16
## 2 0.0000 8 32 40
## 3 0.0000 5 27 32
## 4 0.0000 3 10 13
## 5 0.0000 0 1 1
## 6 0.0896 0 1 1
tail(d.bike)
## season yr mnth hr holiday weekday workingday weathersit temp atemp hum
## 17374 1 1 12 18 0 1 1 2 0.26 0.2727 0.48
## 17375 1 1 12 19 0 1 1 2 0.26 0.2576 0.60
## 17376 1 1 12 20 0 1 1 2 0.26 0.2576 0.60
## 17377 1 1 12 21 0 1 1 1 0.26 0.2576 0.60
## 17378 1 1 12 22 0 1 1 1 0.26 0.2727 0.56
## 17379 1 1 12 23 0 1 1 1 0.26 0.2727 0.65
## windspeed casual registered cnt
## 17374 0.1343 10 112 122
## 17375 0.1642 11 108 119
## 17376 0.1642 8 81 89
## 17377 0.1642 7 83 90
## 17378 0.1343 13 48 61
## 17379 0.1343 12 37 49
As it looks like the data set was imported completely. To check if there are any missing values (not available, NA) in this dataset we count the number of NAs in the data set.
sum(is.na(d.bike))
## [1] 0
mean(is.na(d.bike))
## [1] 0
The record contains no rows with missing data.
At this point the data set d.bike is divided into training and test set. The training set should consist of 75% of the data and the test set of 25% of the data.
d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
d.bike.train <- d.bike[d.bike.train.id,]
d.bike.test <- d.bike[-d.bike.train.id,]
In this section each attribute is considered to make certain assumptions. Different statistical diagrams are used for this purpose. The effects of the different attributes on cnt are tested using ANOVA for categorical attributes and a linear regression model for numerical values. The aim of this explorative data analysis is to get a starting point for the model building.
To keep our documentation clear, not all graphical analyses are shown. For example, interactions between predictors are not shown in this document but only described. At these points, references are made to the R-file.
In the very first step we would like to look at the distribution of the response variable count. For this purpose we create a histogram.
ggplot(data = d.bike, aes(x=cnt)) +
geom_histogram(bins=30, colour="black", ylab="Frequency") + xlab("Count") + ylab("Frequency")
## Warning: Ignoring unknown parameters: ylab
mean(d.bike$cnt)
## [1] 189.4631
sd(d.bike$cnt)
## [1] 181.3876
As we can see, this is a right skewed gamma distribution. So in most hours - the number of bike rentals are hourly observations - between 0 and 30 bikes (the most left bar) are rented.
Bootstrapping:
mean(d.bike$cnt)
## [1] 189.4631
id <- sample(1:length(d.bike$cnt), replace = TRUE)
mean(d.bike$cnt[id])
## [1] 188.984
B <- 1000
t.mean <- c()
for(i in 1:B){
t.id <- sample(1:length(d.bike$cnt), replace = TRUE)
t.d.bike <- d.bike$cnt[t.id]
t.mean[i] <- mean(t.d.bike)
}
length(t.mean)
## [1] 1000
hist(t.mean, breaks = 50)
abline(v = mean(d.bike$cnt), col = "red")
sorted.means <- sort(t.mean)
quantile(sorted.means, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 186.8310 191.9523
If we want to visualize the effect of the season as a categorical variable on the number of bike rentals it is a good idea to use a box plot.
ggplot(data = d.bike, aes(group=season, y = cnt, x = as.factor(season))) +
geom_boxplot() + xlab("Season") + ylab("Count")
As the box plot shows, season 1 (spring) is particularly different from season 2, 3 and 4 (summer, autumn and winter). The median is lowest in spring, even lower than in winter. In summer, autumn and winter, the medians differ only marginally. Since the medians differ partially, one can assume a slight effect. However, we want to examine this more closely.
With an Analysis of Variance (ANOVA) we check whether at least one level of Seasons has a significant effect on the number of bike rentals. The first step is to create a model that does not consider any effect from season. With an F-test we compare then both models with each other.
lm.season.1 <- lm(cnt ~ as.factor(season), data = d.bike)
lm.season.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.season.0, lm.season.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(season)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17375 534032233 3 37729358 409.18 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from the p-value there is strong evidence that at least one level in season has a strong effect on the number of bike rentals. The lower value of the residual sums of squares in lm.season.1 also indicates that the more complex model has smaller residuals than the less complex model lm.season.0.
Since we also consider year to be a categorical variable, a box plot is again a good choice. For all categorical variables we will use a boxplot for the graphical representation in the following.
ggplot(data = d.bike, aes(group=yr, y = cnt, x = yr)) +
geom_boxplot()
At first glance we can see that the number of bike rentals has increased in year 1 (2012). The median is above the median of year 0 (2011).
lm.year.1 <- lm(cnt ~ as.factor(yr), data = d.bike)
lm.year.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.year.0, lm.year.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(yr)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17377 535884870 1 35876722 1163.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value shows strong evidence that one of the two years 2010 and 2011 have a significant effect on cnt. Furthermore, the RSS shows that the more complex model lm.year.1 has significantly smaller residuals than the simpler one.
ggplot(data = d.bike, aes(group=mnth, y = cnt, x = as.factor(mnth))) +
geom_boxplot() + scale_x_discrete(limits=seq(1,12)) + xlab("Month") + ylab("Count")
Here we see that the bike rental increased during the summer months May to September or when the temperatures were more pleasant. Visually, the months seem to have an effect on cnt.
lm.month.1 <- lm(cnt ~ as.factor(mnth), data = d.bike)
lm.month.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.month.0, lm.month.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(mnth)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17367 528851615 11 42909976 128.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value shows, there is strong evidence that at least one month must have a significant effect on cnt. Also the RSS shows little surprising. The more complex model lm.month.1 seems to represent the data better than the model lm.month.0.
ggplot(data = d.bike, aes(group=hr, y = cnt, x = as.factor(hr))) +
geom_boxplot() + scale_x_discrete(limits=seq(1,23)) + xlab("Hour") + ylab("Count")
## Warning: Removed 726 rows containing missing values (stat_boxplot).
The same applies to the time of day and hour. Bikes were most often rented in the morning between 7 and 8 am. and in the evening between 5 and 6 pm. The same flow also applies to the median. The individual times of day seem to differ greatly from one another.
lm.hr.1 <- lm(cnt ~ as.factor(hr), data = d.bike)
lm.hr.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.hr.0, lm.hr.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(hr)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17355 285026910 23 286734681 759.09 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value also reflects this. At least one time of day has a significant effect on cnt. The RSS differs here extremely strongly between complex and simple models. The time of day will probably play a role in the final regression model.
ggplot(data = d.bike, aes(group=holiday, y = cnt, x = as.factor(holiday))) +
geom_boxplot() + xlab("Holiday") + ylab("Count")
During the holiday period, fewer bikes are rented according to the boxplot diagram. However, the difference is very pleasant compared to the normal working period. The holiday period median is close to the work time median.
lm.holiday.1 <- lm(cnt ~ as.factor(holiday), data = d.bike)
lm.holiday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.holiday.0, lm.holiday.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(holiday)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17377 571214702 1 546889 16.637 4.546e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value shows strong evidence that no holiday and/or holiday has a significant effect on cnt. However, if we look at RSS at this point, which differs only minimally between simple and complex model, holiday could be a feature that will be not included in the model.
ggplot(data = d.bike, aes(group=weekday, y= cnt, x = as.factor(weekday))) +
geom_boxplot() + xlab("Weekday") + ylab("Count")
During all seven days of the week, the bike rental is almost constant. On Monday and Sunday there are less bikes rented. The median is lowest on Monday and highest on Saturday. On all other days the median is similar.
lm.weekday.1 <- lm(cnt ~ as.factor(weekday), data = d.bike)
lm.weekday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.weekday.0, lm.weekday.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(weekday)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17372 571073662 6 687929 3.4878 0.001899 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value shows, there is only medium evidence that at least one weekday has a significant effect on cnt. Also the RSS does not differ too much between simple and complex model. This leads to the conclusion that probably the weekday is not included in the final model.
ggplot(data = d.bike, aes(group=workingday, y = cnt, x = as.factor(workingday))) +
geom_boxplot() + xlab("Working day") + ylab("Count")
On working or not-working days, the count of the bike rentals are almost equal. This is also identical with the both medians.
lm.workingday.1 <- lm(cnt ~ as.factor(workingday), data = d.bike)
lm.workingday.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.workingday.0, lm.workingday.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(workingday)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17377 571237204 1 524387 15.952 6.524e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value shows, there is strong evidence that the no working day and/or working day has a significant effect on cnt. Again, the RSS does not differ very much. Working day possibly brings no benefit in the model and will probably be omitted.
ggplot(data = d.bike, aes(group=weathersit, y = cnt, x = as.factor(weathersit))) +
geom_boxplot() + xlab("Weather Situation") + ylab("Count")
If we look at the weather divided on a scale between 1 and 4 as shown in the boxplot, we see that the bike rental is steadily decreasing from 1 to 4. The value 1 is the good weather and 4 the bad weather. Whereby from the value 3 away, probably already bad weather, the renting goes down faster. An identical flow can also be observed for the medians, starting from the first box plot 1 to 4.
lm.weathersit.1 <- lm(cnt ~ as.factor(weathersit), data = d.bike)
lm.weathersit.0 <- lm(cnt ~ 1, data = d.bike)
anova(lm.weathersit.0, lm.weathersit.1)
## Analysis of Variance Table
##
## Model 1: cnt ~ 1
## Model 2: cnt ~ as.factor(weathersit)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17378 571761591
## 2 17375 559476561 3 12285030 127.17 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is strong evidence that at least one weather situation has a significant effect on cnt. The RSS differs greatly between the two models. This will probably be a feature that makes it into the final model.
ggplot(data = d.bike, mapping = aes(y = cnt, x = temp)) +
geom_point() + xlab("Normalized Temp") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The representation of the regression line confirms the positive correlation. As the smoothing spline shows, temperature is a more complex relationship than linearity. It is possible that a Generalised Additive Model (GAM) could be used if temp is considered in the final model. As it seems higher temperatures lead to a higher number of rented bikes.
library(mgcv)
gam.temp.1 <- gam(cnt ~ s(temp), data = d.bike)
summary(gam.temp.1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ s(temp)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.463 1.252 151.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(temp) 8.747 8.975 403.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.172 Deviance explained = 17.3%
## GCV = 27250 Scale est. = 27235 n = 17379
First, a relatively high estimated degree of freedom (edf) of 8,747 is striking. This indicates the high complexity of the relationship. The p-value shows strong evidence that the slope is not 0 and thus there is a correlation between temp and cnt. The adjusted R-Squared shows that 17 percent of the variability is explained by the data. Thus temp could play a role in the final model. temp shows interaction effects with all of the categorical variables except the yr (see line 142, group13_bikesharing.R).
ggplot(data = d.bike, mapping = aes(y = cnt, x = atemp)) +
geom_point() + xlab("Normalized Feeling Temp") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Not surprisingly, atemp is also related to a higher degree of complexity. So again a GAM is useful to investigate the effect of atemp on cnt.
gam.atemp.1 <- gam(cnt ~ s(atemp), data = d.bike)
summary(gam.atemp.1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ s(atemp)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.463 1.253 151.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(atemp) 7.556 8.338 427.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.17 Deviance explained = 17.1%
## GCV = 27315 Scale est. = 27301 n = 17379
The edf again shows a more complex relationship between atemp and cnt. The p-value shows strong evidence that the slope is not 0. The R-Squared shows that the model explains 17% of the data. Therefore, atemp could also play a role in the final model. atemp shows interaction effects with all of the categorical variables except the yr (see line 160, group13_bikesharing.R).
ggplot(data = d.bike, mapping = aes(y = cnt, x = hum)) +
geom_point() + xlab("Humidity") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The smoothing spline shows a possible quadratic effect of hum on cnt. Once again, a GAM is the obvious choice.
gam.hum.1 <- gam(cnt ~ s(hum), data = d.bike)
summary(gam.hum.1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ s(hum)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.463 1.298 146 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(hum) 7.872 8.548 252.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.111 Deviance explained = 11.1%
## GCV = 29281 Scale est. = 29266 n = 17379
As the hypothesis test shows, there is strong evidence that hum has an effect on cnt. Furthermore, the R-Squared shows that the model explains 11% of the data. Thus, hum could also be relevant for the final model. hum shows interaction effects with all of the categorical variables (see line 178, group13_bikesharing.R).
ggplot(data = d.bike, mapping = aes(y = cnt, x = windspeed)) +
geom_point() + xlab("Windspeed") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The graph shows that windspeed might have a linear effect on cnt. Furthermore the smoothing spline shows that there is very little influence of windspeed, because the curve is almost flat.
lm.windspeed.1 <- lm(cnt ~ windspeed, data = d.bike)
summary(lm.windspeed.1)
##
## Call:
## lm(formula = cnt ~ windspeed, data = d.bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -265.47 -146.00 -48.75 90.25 806.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.185 2.532 64.46 <2e-16 ***
## windspeed 138.233 11.198 12.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 180.6 on 17377 degrees of freedom
## Multiple R-squared: 0.008693, Adjusted R-squared: 0.008635
## F-statistic: 152.4 on 1 and 17377 DF, p-value: < 2.2e-16
The wind speed has a positive effect on cnt. This is surprising, since the observed data seems to decrease with increasing wind speed. As R-Squared shows, this regression model explains only 0.8% of the data. This indicates that the attribute can be omitted in the final model.
ggplot(data = d.bike, mapping = aes(y = cnt, x = casual)) +
geom_point() + xlab("Casual Users") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
gam.casual.1 <- gam(cnt ~ s(casual), data = d.bike)
summary(gam.casual.1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ s(casual)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.4631 0.8841 214.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(casual) 8.928 8.998 2748 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.587 Deviance explained = 58.7%
## GCV = 13591 Scale est. = 13583 n = 17379
ggplot(data = d.bike, mapping = aes(y = cnt, x = registered)) +
geom_point() + xlab("Registered Users") + ylab("Count") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
lm.registered.1 <- lm(cnt ~ registered, data = d.bike)
summary(lm.registered.1)
##
## Call:
## lm(formula = cnt ~ registered, data = d.bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.031 -16.264 -9.957 4.730 304.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.296458 0.459718 22.4 <2e-16 ***
## registered 1.165032 0.002131 546.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.51 on 17377 degrees of freedom
## Multiple R-squared: 0.9451, Adjusted R-squared: 0.9451
## F-statistic: 2.99e+05 on 1 and 17377 DF, p-value: < 2.2e-16
Both casual and registered users show a positive effect and explain a large part of the data with the model. This suggests that casual and registered users need not be considered separately. Furthermore, casual and registered users are not further distinguished and are therefore not considered in the model.
In this section, the insights gained from the exploratory data analysis and the consideration of the individual attributes should help to form a hypothesis about what the final model might look like. This hypothesis is the starting point for developing the model.
The explorative data analysis has shown that the attributes season, yr, mnth, hr, weathersit, temp, atemp and hum can probably be considered in the model. From this, the following model can be assumed:
cnt = season + yr + mnth + hr + weathersit + temp + atemp + hum + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum
First five models are initialized. The full model with all predictors, the starting model with all interactions, the starting model without considering the interactions, the starting model without considering the polynomial effect and the starting model without considering polynomial and interaction effects.
full.model.1 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(holiday) + as.factor(weekday) + as.factor(workingday) + as.factor(weathersit) + temp + atemp + hum + windspeed + casual + registered
starting.model.1 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) + poly(temp, degree = 8) + poly(atemp, degree = 8.9) + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum
starting.model.2 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) + poly(temp, degree = 8) + poly(atemp, degree = 8.9)
starting.model.3 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + temp + atemp + hum + season:temp + season:atemp + season:hum + yr:hum + mnth:temp + mnth:atemp + mnth:hum + hr:temp + hr:atemp + hr:hum + weathersit:temp + weathersit:atemp + weathersit:hum
starting.model.4 <- cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + temp + atemp + hum
To find out which of the four starting models fits the data better, the results of the four models are first compared.
lm.starting.model.1 <- lm(starting.model.1, data = d.bike)
lm.starting.model.2 <- lm(starting.model.2, data = d.bike)
lm.starting.model.3 <- lm(starting.model.3, data = d.bike)
lm.starting.model.4 <- lm(starting.model.4, data = d.bike)
summary(lm.starting.model.1)$adj.r.squared
## [1] 0.7467235
summary(lm.starting.model.2)$adj.r.squared
## [1] 0.6943448
summary(lm.starting.model.3)$adj.r.squared
## [1] 0.7414885
summary(lm.starting.model.4)$adj.r.squared
## [1] 0.6837583
If only the adjusted R-Squared is considered at this point, it is shown that the starting.model.1 with a value of about 0.747 explains the variability of the data better than the other starting models. Thus we take the starting.model.1 as a starting point for the model development.
summary(lm.starting.model.1)
##
## Call:
## lm(formula = starting.model.1, data = d.bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -409.85 -50.61 -3.05 41.55 431.06
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.504e+02 2.732e+01 -5.505 3.74e-08 ***
## as.factor(season)2 4.552e+01 2.008e+01 2.267 0.023428 *
## as.factor(season)3 1.696e+02 4.735e+01 3.581 0.000343 ***
## as.factor(season)4 1.593e+02 2.250e+01 7.080 1.50e-12 ***
## as.factor(yr)1 1.729e+02 5.050e+00 34.232 < 2e-16 ***
## as.factor(mnth)2 1.749e+01 1.574e+01 1.111 0.266646
## as.factor(mnth)3 -2.784e+01 1.830e+01 -1.521 0.128260
## as.factor(mnth)4 -1.754e+00 3.004e+01 -0.058 0.953434
## as.factor(mnth)5 4.633e+01 3.951e+01 1.173 0.240950
## as.factor(mnth)6 5.654e+01 4.732e+01 1.195 0.232197
## as.factor(mnth)7 -1.583e+01 7.914e+01 -0.200 0.841505
## as.factor(mnth)8 -9.061e+01 6.983e+01 -1.298 0.194425
## as.factor(mnth)9 -3.648e+01 5.214e+01 -0.700 0.484183
## as.factor(mnth)10 -5.008e+01 3.398e+01 -1.474 0.140532
## as.factor(mnth)11 -1.219e+02 2.957e+01 -4.123 3.76e-05 ***
## as.factor(mnth)12 -8.407e+01 2.251e+01 -3.735 0.000188 ***
## as.factor(hr)1 -5.915e+00 2.355e+01 -0.251 0.801717
## as.factor(hr)2 -2.188e+01 2.367e+01 -0.924 0.355258
## as.factor(hr)3 -3.934e+01 2.450e+01 -1.606 0.108363
## as.factor(hr)4 -5.036e+01 2.494e+01 -2.019 0.043481 *
## as.factor(hr)5 -3.047e+01 2.457e+01 -1.240 0.214877
## as.factor(hr)6 -8.563e+00 2.427e+01 -0.353 0.724207
## as.factor(hr)7 5.986e+01 2.434e+01 2.459 0.013928 *
## as.factor(hr)8 2.063e+02 2.401e+01 8.593 < 2e-16 ***
## as.factor(hr)9 1.239e+02 2.338e+01 5.297 1.19e-07 ***
## as.factor(hr)10 8.564e+01 2.309e+01 3.708 0.000209 ***
## as.factor(hr)11 1.038e+02 2.287e+01 4.539 5.69e-06 ***
## as.factor(hr)12 1.458e+02 2.285e+01 6.379 1.83e-10 ***
## as.factor(hr)13 1.507e+02 2.295e+01 6.570 5.19e-11 ***
## as.factor(hr)14 1.421e+02 2.304e+01 6.170 6.98e-10 ***
## as.factor(hr)15 1.515e+02 2.304e+01 6.574 5.02e-11 ***
## as.factor(hr)16 1.511e+02 2.295e+01 6.584 4.72e-11 ***
## as.factor(hr)17 1.849e+02 2.281e+01 8.104 5.68e-16 ***
## as.factor(hr)18 1.380e+02 2.259e+01 6.107 1.03e-09 ***
## as.factor(hr)19 8.509e+01 2.265e+01 3.757 0.000173 ***
## as.factor(hr)20 4.316e+01 2.273e+01 1.898 0.057657 .
## as.factor(hr)21 2.795e+01 2.291e+01 1.220 0.222417
## as.factor(hr)22 1.631e+01 2.304e+01 0.708 0.479119
## as.factor(hr)23 1.517e+00 2.333e+01 0.065 0.948171
## as.factor(weathersit)2 -1.883e+01 8.513e+00 -2.212 0.027012 *
## as.factor(weathersit)3 4.452e+00 2.243e+01 0.199 0.842651
## as.factor(weathersit)4 -9.061e+01 2.980e+02 -0.304 0.761043
## poly(hum, degree = 8.7)1 2.763e+03 9.508e+02 2.906 0.003669 **
## poly(hum, degree = 8.7)2 -1.021e+03 1.348e+02 -7.569 3.94e-14 ***
## poly(hum, degree = 8.7)3 -1.429e+02 1.238e+02 -1.154 0.248464
## poly(hum, degree = 8.7)4 2.572e+02 1.103e+02 2.332 0.019728 *
## poly(hum, degree = 8.7)5 -3.879e+01 1.036e+02 -0.375 0.707961
## poly(hum, degree = 8.7)6 -6.089e+01 9.550e+01 -0.638 0.523777
## poly(hum, degree = 8.7)7 9.659e+01 9.376e+01 1.030 0.302964
## poly(hum, degree = 8.7)8 -3.870e+02 9.332e+01 -4.147 3.39e-05 ***
## poly(temp, degree = 8)1 1.181e+04 7.092e+03 1.665 0.095835 .
## poly(temp, degree = 8)2 -2.931e+03 9.130e+02 -3.211 0.001326 **
## poly(temp, degree = 8)3 -2.470e+03 5.619e+02 -4.395 1.11e-05 ***
## poly(temp, degree = 8)4 -8.040e+02 3.172e+02 -2.534 0.011270 *
## poly(temp, degree = 8)5 1.081e+02 2.761e+02 0.392 0.695426
## poly(temp, degree = 8)6 4.461e+02 2.137e+02 2.087 0.036884 *
## poly(temp, degree = 8)7 1.532e+02 1.409e+02 1.088 0.276681
## poly(temp, degree = 8)8 -5.470e+01 1.285e+02 -0.426 0.670404
## poly(atemp, degree = 8.9)1 -1.793e+04 6.725e+03 -2.667 0.007666 **
## poly(atemp, degree = 8.9)2 4.792e+02 8.660e+02 0.553 0.580003
## poly(atemp, degree = 8.9)3 3.371e+02 5.276e+02 0.639 0.522839
## poly(atemp, degree = 8.9)4 -5.636e+01 3.058e+02 -0.184 0.853746
## poly(atemp, degree = 8.9)5 -1.515e+02 2.802e+02 -0.541 0.588692
## poly(atemp, degree = 8.9)6 -1.293e+02 2.097e+02 -0.617 0.537502
## poly(atemp, degree = 8.9)7 1.892e+02 1.787e+02 1.059 0.289715
## poly(atemp, degree = 8.9)8 7.118e+01 1.529e+02 0.465 0.641603
## season1:temp -5.197e+02 2.002e+02 -2.595 0.009459 **
## season2:temp -3.614e+02 2.364e+02 -1.529 0.126299
## season3:temp 6.286e+00 2.027e+02 0.031 0.975266
## season4:temp NA NA NA NA
## season1:atemp 6.683e+02 2.149e+02 3.110 0.001876 **
## season2:atemp 5.484e+02 2.615e+02 2.097 0.035975 *
## season3:atemp -4.823e+01 2.327e+02 -0.207 0.835807
## season4:atemp NA NA NA NA
## season1:hum 4.992e+01 2.686e+01 1.859 0.063096 .
## season2:hum -5.991e+00 2.999e+01 -0.200 0.841651
## season3:hum 6.459e-01 2.966e+01 0.022 0.982628
## season4:hum NA NA NA NA
## hum:yr1 -1.431e+02 7.750e+00 -18.465 < 2e-16 ***
## temp:mnth2 -5.400e+01 1.523e+02 -0.355 0.722967
## temp:mnth3 2.341e+02 1.994e+02 1.174 0.240344
## temp:mnth4 -2.552e+02 3.177e+02 -0.803 0.421759
## temp:mnth5 -1.068e+02 3.318e+02 -0.322 0.747478
## temp:mnth6 -1.233e+02 3.134e+02 -0.393 0.694003
## temp:mnth7 -2.349e+02 3.488e+02 -0.674 0.500624
## temp:mnth8 -2.143e+02 3.024e+02 -0.709 0.478607
## temp:mnth9 -3.485e+02 3.116e+02 -1.118 0.263450
## temp:mnth10 -1.929e+02 2.911e+02 -0.663 0.507577
## temp:mnth11 -3.472e+02 2.703e+02 -1.285 0.198961
## temp:mnth12 -1.495e+02 2.021e+02 -0.740 0.459336
## atemp:mnth2 5.027e+01 1.554e+02 0.324 0.746235
## atemp:mnth3 -8.513e+01 2.038e+02 -0.418 0.676162
## atemp:mnth4 3.775e+02 3.326e+02 1.135 0.256345
## atemp:mnth5 1.891e+02 3.538e+02 0.534 0.593053
## atemp:mnth6 2.316e+02 3.315e+02 0.699 0.484778
## atemp:mnth7 4.883e+02 3.606e+02 1.354 0.175647
## atemp:mnth8 5.821e+02 3.172e+02 1.835 0.066522 .
## atemp:mnth9 7.407e+02 3.366e+02 2.201 0.027763 *
## atemp:mnth10 5.761e+02 3.106e+02 1.855 0.063657 .
## atemp:mnth11 6.915e+02 2.852e+02 2.425 0.015327 *
## atemp:mnth12 4.586e+02 2.092e+02 2.192 0.028369 *
## hum:mnth2 -1.004e+01 1.943e+01 -0.517 0.605280
## hum:mnth3 6.240e+00 2.028e+01 0.308 0.758322
## hum:mnth4 -3.714e+01 2.818e+01 -1.318 0.187555
## hum:mnth5 -5.447e+01 3.013e+01 -1.808 0.070613 .
## hum:mnth6 -9.128e+01 3.084e+01 -2.960 0.003077 **
## hum:mnth7 -1.021e+02 4.481e+01 -2.277 0.022771 *
## hum:mnth8 -1.128e+02 4.084e+01 -2.761 0.005770 **
## hum:mnth9 -1.734e+02 3.597e+01 -4.822 1.43e-06 ***
## hum:mnth10 -1.339e+02 3.426e+01 -3.910 9.28e-05 ***
## hum:mnth11 -2.618e+01 3.363e+01 -0.779 0.436265
## hum:mnth12 -3.117e+01 2.752e+01 -1.133 0.257308
## temp:hr1 8.673e-01 1.758e+02 0.005 0.996063
## temp:hr2 -9.537e+01 1.743e+02 -0.547 0.584362
## temp:hr3 -1.269e+02 1.779e+02 -0.713 0.475714
## temp:hr4 -7.483e+01 1.804e+02 -0.415 0.678289
## temp:hr5 -8.529e+01 1.796e+02 -0.475 0.634960
## temp:hr6 -5.963e+01 1.770e+02 -0.337 0.736253
## temp:hr7 -1.690e+02 1.775e+02 -0.952 0.341036
## temp:hr8 -6.394e+01 1.771e+02 -0.361 0.718106
## temp:hr9 -7.412e+01 1.726e+02 -0.429 0.667673
## temp:hr10 -1.457e+02 1.716e+02 -0.849 0.395837
## temp:hr11 -1.314e+02 1.687e+02 -0.779 0.435943
## temp:hr12 -1.068e+02 1.645e+02 -0.649 0.516091
## temp:hr13 -1.331e+02 1.630e+02 -0.817 0.413986
## temp:hr14 -1.027e+02 1.625e+02 -0.632 0.527328
## temp:hr15 6.105e+01 1.620e+02 0.377 0.706356
## temp:hr16 6.380e+01 1.658e+02 0.385 0.700395
## temp:hr17 1.520e+02 1.640e+02 0.927 0.354028
## temp:hr18 8.601e+01 1.684e+02 0.511 0.609572
## temp:hr19 1.728e+02 1.708e+02 1.012 0.311600
## temp:hr20 1.763e+02 1.731e+02 1.018 0.308536
## temp:hr21 8.595e+01 1.751e+02 0.491 0.623498
## temp:hr22 6.471e+01 1.803e+02 0.359 0.719762
## temp:hr23 9.546e+00 1.828e+02 0.052 0.958359
## atemp:hr1 -5.446e+01 1.956e+02 -0.278 0.780730
## atemp:hr2 3.448e+01 1.948e+02 0.177 0.859520
## atemp:hr3 5.586e+01 1.993e+02 0.280 0.779228
## atemp:hr4 -7.392e+00 2.020e+02 -0.037 0.970808
## atemp:hr5 2.561e+01 2.014e+02 0.127 0.898798
## atemp:hr6 1.237e+02 1.983e+02 0.623 0.532972
## atemp:hr7 4.808e+02 1.983e+02 2.425 0.015330 *
## atemp:hr8 4.041e+02 1.967e+02 2.055 0.039926 *
## atemp:hr9 2.373e+02 1.907e+02 1.244 0.213467
## atemp:hr10 3.401e+02 1.897e+02 1.793 0.073065 .
## atemp:hr11 3.819e+02 1.868e+02 2.045 0.040899 *
## atemp:hr12 4.099e+02 1.825e+02 2.247 0.024676 *
## atemp:hr13 4.393e+02 1.813e+02 2.423 0.015388 *
## atemp:hr14 3.955e+02 1.808e+02 2.187 0.028723 *
## atemp:hr15 2.138e+02 1.804e+02 1.185 0.235978
## atemp:hr16 3.478e+02 1.848e+02 1.882 0.059841 .
## atemp:hr17 5.423e+02 1.830e+02 2.963 0.003046 **
## atemp:hr18 6.125e+02 1.872e+02 3.272 0.001070 **
## atemp:hr19 3.677e+02 1.898e+02 1.937 0.052754 .
## atemp:hr20 2.202e+02 1.919e+02 1.147 0.251200
## atemp:hr21 1.984e+02 1.939e+02 1.023 0.306220
## atemp:hr22 1.238e+02 1.998e+02 0.620 0.535565
## atemp:hr23 7.664e+01 2.024e+02 0.379 0.705004
## hum:hr1 1.399e+01 2.996e+01 0.467 0.640483
## hum:hr2 2.774e+01 3.032e+01 0.915 0.360159
## hum:hr3 4.193e+01 3.143e+01 1.334 0.182186
## hum:hr4 5.384e+01 3.177e+01 1.695 0.090179 .
## hum:hr5 3.279e+01 3.162e+01 1.037 0.299715
## hum:hr6 1.058e+01 3.109e+01 0.340 0.733522
## hum:hr7 -3.870e+01 3.059e+01 -1.265 0.205856
## hum:hr8 -6.984e+01 3.027e+01 -2.307 0.021044 *
## hum:hr9 -4.772e+01 2.956e+01 -1.614 0.106495
## hum:hr10 -8.687e+01 2.934e+01 -2.961 0.003071 **
## hum:hr11 -1.163e+02 2.919e+01 -3.986 6.76e-05 ***
## hum:hr12 -1.659e+02 2.915e+01 -5.691 1.28e-08 ***
## hum:hr13 -1.858e+02 2.915e+01 -6.374 1.89e-10 ***
## hum:hr14 -1.861e+02 2.895e+01 -6.429 1.32e-10 ***
## hum:hr15 -1.756e+02 2.875e+01 -6.108 1.03e-09 ***
## hum:hr16 -1.984e+02 2.858e+01 -6.943 3.97e-12 ***
## hum:hr17 -2.557e+02 2.823e+01 -9.061 < 2e-16 ***
## hum:hr18 -2.174e+02 2.830e+01 -7.683 1.64e-14 ***
## hum:hr19 -1.691e+02 2.821e+01 -5.995 2.08e-09 ***
## hum:hr20 -1.085e+02 2.834e+01 -3.830 0.000129 ***
## hum:hr21 -7.307e+01 2.867e+01 -2.548 0.010832 *
## hum:hr22 -4.286e+01 2.882e+01 -1.487 0.137069
## hum:hr23 -9.168e+00 2.924e+01 -0.314 0.753864
## temp:weathersit2 2.035e+00 6.183e+01 0.033 0.973736
## temp:weathersit3 2.983e+01 1.033e+02 0.289 0.772719
## temp:weathersit4 -4.949e+02 2.870e+03 -0.172 0.863102
## atemp:weathersit2 -2.220e+01 6.882e+01 -0.323 0.746973
## atemp:weathersit3 -8.854e+01 1.149e+02 -0.770 0.441079
## atemp:weathersit4 8.651e+02 2.579e+03 0.335 0.737282
## hum:weathersit2 3.037e+01 1.055e+01 2.879 0.003989 **
## hum:weathersit3 -3.694e+01 2.308e+01 -1.600 0.109568
## hum:weathersit4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.29 on 17193 degrees of freedom
## Multiple R-squared: 0.7494, Adjusted R-squared: 0.7467
## F-statistic: 277.9 on 185 and 17193 DF, p-value: < 2.2e-16
If now the p-value of the individual predictors and the interaction relationships in the output of starting.model.1 are considered, it becomes clear that certain attributes have no significant effect on cnt. The interaction relationships weathersit:atemp, weathersit:temp, hr:atemp and mnth:temp are deleted from the model. In the second iteration the interaction relations and predictors with a weak and a medium effect are removed from the model. These are: mnth:hum, season:temp and weathersit:atemp. The iteration was performed separately in the analysis (see lines 246-303, group13_bikesharing.R). For reasons of clarity, the predictors and interactions are removed in one pass.
##1. iteration
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:atemp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:temp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - hr:atemp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - mnth:temp)
##2. iteration
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - mnth:hum)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - season:temp)
lm.starting.model.1 <- update(lm.starting.model.1, . ~ . - weathersit:hum)
summary(lm.starting.model.1)
##
## Call:
## lm(formula = cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) +
## as.factor(hr) + as.factor(weathersit) + poly(hum, degree = 8.7) +
## poly(temp, degree = 8) + poly(atemp, degree = 8.9) + season:atemp +
## season:hum + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -418.83 -51.19 -2.54 41.54 431.19
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -143.267 25.714 -5.572 2.56e-08 ***
## as.factor(season)2 73.324 17.205 4.262 2.04e-05 ***
## as.factor(season)3 298.104 38.588 7.725 1.18e-14 ***
## as.factor(season)4 195.091 16.508 11.818 < 2e-16 ***
## as.factor(yr)1 172.268 4.990 34.521 < 2e-16 ***
## as.factor(mnth)2 10.876 10.721 1.014 0.310367
## as.factor(mnth)3 -20.106 13.838 -1.453 0.146264
## as.factor(mnth)4 -9.261 23.624 -0.392 0.695051
## as.factor(mnth)5 21.410 30.430 0.704 0.481711
## as.factor(mnth)6 7.077 38.071 0.186 0.852537
## as.factor(mnth)7 -40.325 50.737 -0.795 0.426750
## as.factor(mnth)8 -113.138 47.319 -2.391 0.016815 *
## as.factor(mnth)9 -182.423 40.325 -4.524 6.11e-06 ***
## as.factor(mnth)10 -136.993 23.215 -5.901 3.68e-09 ***
## as.factor(mnth)11 -124.801 20.972 -5.951 2.72e-09 ***
## as.factor(mnth)12 -99.478 15.568 -6.390 1.70e-10 ***
## as.factor(hr)1 -7.430 22.747 -0.327 0.743961
## as.factor(hr)2 -20.040 22.900 -0.875 0.381518
## as.factor(hr)3 -37.708 23.616 -1.597 0.110357
## as.factor(hr)4 -51.305 24.012 -2.137 0.032644 *
## as.factor(hr)5 -32.269 23.650 -1.364 0.172442
## as.factor(hr)6 -7.857 23.362 -0.336 0.736631
## as.factor(hr)7 73.630 23.393 3.148 0.001649 **
## as.factor(hr)8 215.553 23.377 9.221 < 2e-16 ***
## as.factor(hr)9 130.726 22.817 5.729 1.02e-08 ***
## as.factor(hr)10 94.937 22.567 4.207 2.60e-05 ***
## as.factor(hr)11 112.811 22.286 5.062 4.19e-07 ***
## as.factor(hr)12 154.479 22.217 6.953 3.70e-12 ***
## as.factor(hr)13 159.391 22.255 7.162 8.27e-13 ***
## as.factor(hr)14 149.476 22.303 6.702 2.12e-11 ***
## as.factor(hr)15 154.276 22.290 6.921 4.63e-12 ***
## as.factor(hr)16 157.121 22.198 7.078 1.52e-12 ***
## as.factor(hr)17 198.580 22.063 9.001 < 2e-16 ***
## as.factor(hr)18 153.956 21.929 7.021 2.29e-12 ***
## as.factor(hr)19 97.181 21.871 4.443 8.91e-06 ***
## as.factor(hr)20 50.504 21.879 2.308 0.020994 *
## as.factor(hr)21 33.505 22.027 1.521 0.128254
## as.factor(hr)22 19.421 22.067 0.880 0.378822
## as.factor(hr)23 5.565 22.301 0.250 0.802963
## as.factor(weathersit)2 -7.823 1.752 -4.464 8.10e-06 ***
## as.factor(weathersit)3 -54.424 3.076 -17.694 < 2e-16 ***
## as.factor(weathersit)4 -23.638 53.297 -0.444 0.657400
## poly(hum, degree = 8.7)1 957.555 586.675 1.632 0.102661
## poly(hum, degree = 8.7)2 -914.302 120.837 -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3 -177.031 98.917 -1.790 0.073521 .
## poly(hum, degree = 8.7)4 384.352 99.925 3.846 0.000120 ***
## poly(hum, degree = 8.7)5 -112.904 95.333 -1.184 0.236309
## poly(hum, degree = 8.7)6 -22.383 93.126 -0.240 0.810065
## poly(hum, degree = 8.7)7 64.497 92.906 0.694 0.487556
## poly(hum, degree = 8.7)8 -377.214 92.651 -4.071 4.69e-05 ***
## poly(temp, degree = 8)1 -2134.381 1133.604 -1.883 0.059741 .
## poly(temp, degree = 8)2 -3474.213 464.949 -7.472 8.26e-14 ***
## poly(temp, degree = 8)3 -2777.949 417.254 -6.658 2.87e-11 ***
## poly(temp, degree = 8)4 -457.491 270.972 -1.688 0.091365 .
## poly(temp, degree = 8)5 213.518 248.039 0.861 0.389347
## poly(temp, degree = 8)6 328.755 198.040 1.660 0.096925 .
## poly(temp, degree = 8)7 177.723 136.542 1.302 0.193071
## poly(temp, degree = 8)8 -55.473 124.113 -0.447 0.654912
## poly(atemp, degree = 8.9)1 -1740.835 1685.322 -1.033 0.301647
## poly(atemp, degree = 8.9)2 1224.997 461.324 2.655 0.007929 **
## poly(atemp, degree = 8.9)3 728.013 390.406 1.865 0.062233 .
## poly(atemp, degree = 8.9)4 -371.555 259.581 -1.431 0.152344
## poly(atemp, degree = 8.9)5 -294.540 246.098 -1.197 0.231384
## poly(atemp, degree = 8.9)6 8.645 195.419 0.044 0.964715
## poly(atemp, degree = 8.9)7 229.686 167.824 1.369 0.171138
## poly(atemp, degree = 8.9)8 59.002 147.736 0.399 0.689622
## season1:atemp 91.303 40.433 2.258 0.023949 *
## season2:atemp 137.184 37.128 3.695 0.000221 ***
## season3:atemp -136.185 66.492 -2.048 0.040563 *
## season4:atemp NA NA NA NA
## season1:hum 132.432 11.467 11.549 < 2e-16 ***
## season2:hum 30.183 11.509 2.623 0.008733 **
## season3:hum -56.517 14.683 -3.849 0.000119 ***
## season4:hum NA NA NA NA
## hum:yr1 -140.467 7.649 -18.363 < 2e-16 ***
## atemp:mnth2 -1.972 37.042 -0.053 0.957546
## atemp:mnth3 137.136 41.381 3.314 0.000922 ***
## atemp:mnth4 71.317 60.877 1.172 0.241412
## atemp:mnth5 46.588 69.212 0.673 0.500884
## atemp:mnth6 76.765 78.079 0.983 0.325535
## atemp:mnth7 172.625 92.254 1.871 0.061336 .
## atemp:mnth8 272.623 88.129 3.093 0.001981 **
## atemp:mnth9 411.659 80.779 5.096 3.50e-07 ***
## atemp:mnth10 357.439 61.564 5.806 6.51e-09 ***
## atemp:mnth11 288.204 60.364 4.774 1.82e-06 ***
## atemp:mnth12 285.572 50.310 5.676 1.40e-08 ***
## temp:hr0 -75.269 27.563 -2.731 0.006325 **
## temp:hr1 -123.094 27.696 -4.444 8.87e-06 ***
## temp:hr2 -139.435 28.057 -4.970 6.77e-07 ***
## temp:hr3 -153.613 28.643 -5.363 8.29e-08 ***
## temp:hr4 -157.870 28.775 -5.486 4.16e-08 ***
## temp:hr5 -138.378 28.590 -4.840 1.31e-06 ***
## temp:hr6 -27.234 28.181 -0.966 0.333858
## temp:hr7 179.174 27.123 6.606 4.06e-11 ***
## temp:hr8 220.824 26.350 8.380 < 2e-16 ***
## temp:hr9 65.043 26.066 2.495 0.012594 *
## temp:hr10 90.520 26.011 3.480 0.000502 ***
## temp:hr11 145.219 26.249 5.532 3.21e-08 ***
## temp:hr12 196.763 26.416 7.449 9.87e-14 ***
## temp:hr13 197.233 26.662 7.398 1.45e-13 ***
## temp:hr14 190.124 26.853 7.080 1.49e-12 ***
## temp:hr15 195.200 26.954 7.242 4.61e-13 ***
## temp:hr16 315.517 26.923 11.719 < 2e-16 ***
## temp:hr17 569.891 26.684 21.357 < 2e-16 ***
## temp:hr18 565.865 26.565 21.301 < 2e-16 ***
## temp:hr19 434.331 26.655 16.295 < 2e-16 ***
## temp:hr20 303.636 26.738 11.356 < 2e-16 ***
## temp:hr21 191.067 26.924 7.097 1.33e-12 ***
## temp:hr22 100.882 27.066 3.727 0.000194 ***
## temp:hr23 NA NA NA NA
## hum:hr1 13.718 30.007 0.457 0.647545
## hum:hr2 26.178 30.329 0.863 0.388074
## hum:hr3 42.385 31.456 1.347 0.177856
## hum:hr4 54.370 31.816 1.709 0.087483 .
## hum:hr5 35.891 31.645 1.134 0.256748
## hum:hr6 16.070 31.122 0.516 0.605617
## hum:hr7 -32.245 30.647 -1.052 0.292745
## hum:hr8 -62.430 30.252 -2.064 0.039064 *
## hum:hr9 -46.046 29.542 -1.559 0.119095
## hum:hr10 -87.280 29.232 -2.986 0.002833 **
## hum:hr11 -116.807 29.030 -4.024 5.75e-05 ***
## hum:hr12 -166.615 28.952 -5.755 8.82e-09 ***
## hum:hr13 -185.503 28.951 -6.408 1.52e-10 ***
## hum:hr14 -187.703 28.800 -6.518 7.35e-11 ***
## hum:hr15 -182.570 28.561 -6.392 1.68e-10 ***
## hum:hr16 -201.586 28.365 -7.107 1.23e-12 ***
## hum:hr17 -254.651 28.105 -9.061 < 2e-16 ***
## hum:hr18 -215.800 28.146 -7.667 1.85e-14 ***
## hum:hr19 -174.028 28.186 -6.174 6.80e-10 ***
## hum:hr20 -111.621 28.358 -3.936 8.31e-05 ***
## hum:hr21 -72.597 28.724 -2.527 0.011500 *
## hum:hr22 -41.303 28.889 -1.430 0.152820
## hum:hr23 -9.223 29.302 -0.315 0.752947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.56 on 17249 degrees of freedom
## Multiple R-squared: 0.7471, Adjusted R-squared: 0.7452
## F-statistic: 395 on 129 and 17249 DF, p-value: < 2.2e-16
Now all interactions with weak and medium effect are removed from the model. atemp also shows a medium effect. Since it is present in certain interactions with strong effect, atemp is retained. The adapted R-Squared has only been reduced by 0.0003 points due to the adjustments. The final model now has an adjusted R-Squared of 0.7452, which explains about 74.5% of the data. The final model now looks as follows:
cnt = season + yr + mnth + hr + weathersit + temp + atemp + hum + season:atemp + season:hum + yr:hum + mnth:atemp + hr:temp + hr:hum
The predictors temp, atemp and hum show polynomial effects.
In this section, the result from the section “Model development” shall be validated by cross-validation. For this purpose, all models that played a role in the development of a final model are compared with each other. The mean R-squared of the individual models, which is calculated from 10-fold cross-validation, serves as a comparison value.
lm.starting.model.1 <- lm(starting.model.1, data = d.bike)
lm.starting.model.2 <- lm(starting.model.2, data = d.bike)
lm.starting.model.3 <- lm(starting.model.3, data = d.bike)
lm.starting.model.4 <- lm(starting.model.4, data = d.bike)
lm.full.model.1 <- lm(full.model.1, data = d.bike)
lm.final.model.1 <- lm(final.model.1, data = d.bike)
for(i in 1:10){
d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
d.bike.train <- d.bike[d.bike.train.id,]
d.bike.test <- d.bike[-d.bike.train.id,]
#predict data with starting model 1
lm.starting.model.1.train <- lm(starting.model.1, data = d.bike.train)
predicted.starting.model.1.test <- predict(lm.starting.model.1.train,
newdata = d.bike.test)
r.squared.starting.model.1 <- cor(predicted.starting.model.1.test, d.bike.test$cnt)^2
#predict data with starting model 2
lm.starting.model.2.train <- lm(starting.model.2, data = d.bike.train)
predicted.starting.model.2.test <- predict(lm.starting.model.2.train,
newdata = d.bike.test)
r.squared.starting.model.2 <- cor(predicted.starting.model.2.test, d.bike.test$cnt)^2
#predict data with starting model 3
lm.starting.model.3.train <- lm(starting.model.3, data = d.bike.train)
predicted.starting.model.3.test <- predict(lm.starting.model.3.train,
newdata = d.bike.test)
r.squared.starting.model.3 <- cor(predicted.starting.model.3.test, d.bike.test$cnt)^2
#predict data with starting model 4
lm.starting.model.4.train <- lm(starting.model.4, data = d.bike.train)
predicted.starting.model.4.test <- predict(lm.starting.model.4.train,
newdata = d.bike.test)
r.squared.starting.model.4 <- cor(predicted.starting.model.4.test, d.bike.test$cnt)^2
#predict data with full model
lm.full.model.train <- lm(full.model.1, data = d.bike.train)
predicted.full.model.test <- predict(lm.full.model.train,
newdata = d.bike.test)
r.squared.full.model <- cor(predicted.full.model.test, d.bike.test$cnt)^2
#predict data with final model
lm.final.model.train <- lm(final.model.1, data = d.bike.train)
predicted.final.model.test <- predict(lm.final.model.train,
newdata = d.bike.test)
r.squared.final.model <- cor(predicted.final.model.test, d.bike.test$cnt)^2
}
mean(r.squared.starting.model.1)
## [1] 0.7542875
mean(r.squared.starting.model.2)
## [1] 0.7072345
mean(r.squared.starting.model.3)
## [1] 0.7492328
mean(r.squared.starting.model.4)
## [1] 0.6984628
mean(r.squared.full.model)
## [1] 1
mean(r.squared.final.model)
## [1] 0.7535116
As the comparison shows, the full.model.1 according to R-Squared is the best model. However, since the final model is the second best and much less complex than the full model, the result of the model development can be validated. For further analysis steps final.model.1 is used.
In this chapter the bike rental cnt will be predicted respectively estimated using a linear, a non-linear and a poisson regressive model. The three models will then be compared using a 10-fold cross-validation based on the R-Squared.
lm.bike.1 <- lm(final.model.1, data = d.bike)
summary(lm.bike.1)
##
## Call:
## lm(formula = final.model.1, data = d.bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -418.83 -51.19 -2.54 41.54 431.19
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -143.267 25.714 -5.572 2.56e-08 ***
## as.factor(season)2 73.324 17.205 4.262 2.04e-05 ***
## as.factor(season)3 298.104 38.588 7.725 1.18e-14 ***
## as.factor(season)4 195.091 16.508 11.818 < 2e-16 ***
## as.factor(yr)1 172.268 4.990 34.521 < 2e-16 ***
## as.factor(mnth)2 10.876 10.721 1.014 0.310367
## as.factor(mnth)3 -20.106 13.838 -1.453 0.146264
## as.factor(mnth)4 -9.261 23.624 -0.392 0.695051
## as.factor(mnth)5 21.410 30.430 0.704 0.481711
## as.factor(mnth)6 7.077 38.071 0.186 0.852537
## as.factor(mnth)7 -40.325 50.737 -0.795 0.426750
## as.factor(mnth)8 -113.138 47.319 -2.391 0.016815 *
## as.factor(mnth)9 -182.423 40.325 -4.524 6.11e-06 ***
## as.factor(mnth)10 -136.993 23.215 -5.901 3.68e-09 ***
## as.factor(mnth)11 -124.801 20.972 -5.951 2.72e-09 ***
## as.factor(mnth)12 -99.478 15.568 -6.390 1.70e-10 ***
## as.factor(hr)1 -7.430 22.747 -0.327 0.743961
## as.factor(hr)2 -20.040 22.900 -0.875 0.381518
## as.factor(hr)3 -37.708 23.616 -1.597 0.110357
## as.factor(hr)4 -51.305 24.012 -2.137 0.032644 *
## as.factor(hr)5 -32.269 23.650 -1.364 0.172442
## as.factor(hr)6 -7.857 23.362 -0.336 0.736631
## as.factor(hr)7 73.630 23.393 3.148 0.001649 **
## as.factor(hr)8 215.553 23.377 9.221 < 2e-16 ***
## as.factor(hr)9 130.726 22.817 5.729 1.02e-08 ***
## as.factor(hr)10 94.937 22.567 4.207 2.60e-05 ***
## as.factor(hr)11 112.811 22.286 5.062 4.19e-07 ***
## as.factor(hr)12 154.479 22.217 6.953 3.70e-12 ***
## as.factor(hr)13 159.391 22.255 7.162 8.27e-13 ***
## as.factor(hr)14 149.476 22.303 6.702 2.12e-11 ***
## as.factor(hr)15 154.276 22.290 6.921 4.63e-12 ***
## as.factor(hr)16 157.121 22.198 7.078 1.52e-12 ***
## as.factor(hr)17 198.580 22.063 9.001 < 2e-16 ***
## as.factor(hr)18 153.956 21.929 7.021 2.29e-12 ***
## as.factor(hr)19 97.181 21.871 4.443 8.91e-06 ***
## as.factor(hr)20 50.504 21.879 2.308 0.020994 *
## as.factor(hr)21 33.505 22.027 1.521 0.128254
## as.factor(hr)22 19.421 22.067 0.880 0.378822
## as.factor(hr)23 5.565 22.301 0.250 0.802963
## as.factor(weathersit)2 -7.823 1.752 -4.464 8.10e-06 ***
## as.factor(weathersit)3 -54.424 3.076 -17.694 < 2e-16 ***
## as.factor(weathersit)4 -23.638 53.297 -0.444 0.657400
## poly(hum, degree = 8.7)1 957.555 586.675 1.632 0.102661
## poly(hum, degree = 8.7)2 -914.302 120.837 -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3 -177.031 98.917 -1.790 0.073521 .
## poly(hum, degree = 8.7)4 384.352 99.925 3.846 0.000120 ***
## poly(hum, degree = 8.7)5 -112.904 95.333 -1.184 0.236309
## poly(hum, degree = 8.7)6 -22.383 93.126 -0.240 0.810065
## poly(hum, degree = 8.7)7 64.497 92.906 0.694 0.487556
## poly(hum, degree = 8.7)8 -377.214 92.651 -4.071 4.69e-05 ***
## poly(temp, degree = 8)1 -2134.381 1133.604 -1.883 0.059741 .
## poly(temp, degree = 8)2 -3474.213 464.949 -7.472 8.26e-14 ***
## poly(temp, degree = 8)3 -2777.949 417.254 -6.658 2.87e-11 ***
## poly(temp, degree = 8)4 -457.491 270.972 -1.688 0.091365 .
## poly(temp, degree = 8)5 213.518 248.039 0.861 0.389347
## poly(temp, degree = 8)6 328.755 198.040 1.660 0.096925 .
## poly(temp, degree = 8)7 177.723 136.542 1.302 0.193071
## poly(temp, degree = 8)8 -55.473 124.113 -0.447 0.654912
## poly(atemp, degree = 8.9)1 -1740.835 1685.322 -1.033 0.301647
## poly(atemp, degree = 8.9)2 1224.997 461.324 2.655 0.007929 **
## poly(atemp, degree = 8.9)3 728.013 390.406 1.865 0.062233 .
## poly(atemp, degree = 8.9)4 -371.555 259.581 -1.431 0.152344
## poly(atemp, degree = 8.9)5 -294.540 246.098 -1.197 0.231384
## poly(atemp, degree = 8.9)6 8.645 195.419 0.044 0.964715
## poly(atemp, degree = 8.9)7 229.686 167.824 1.369 0.171138
## poly(atemp, degree = 8.9)8 59.002 147.736 0.399 0.689622
## atemp:season1 91.303 40.433 2.258 0.023949 *
## atemp:season2 137.184 37.128 3.695 0.000221 ***
## atemp:season3 -136.185 66.492 -2.048 0.040563 *
## atemp:season4 NA NA NA NA
## season1:hum 132.432 11.467 11.549 < 2e-16 ***
## season2:hum 30.183 11.509 2.623 0.008733 **
## season3:hum -56.517 14.683 -3.849 0.000119 ***
## season4:hum NA NA NA NA
## hum:yr1 -140.467 7.649 -18.363 < 2e-16 ***
## atemp:mnth2 -1.972 37.042 -0.053 0.957546
## atemp:mnth3 137.136 41.381 3.314 0.000922 ***
## atemp:mnth4 71.317 60.877 1.172 0.241412
## atemp:mnth5 46.588 69.212 0.673 0.500884
## atemp:mnth6 76.765 78.079 0.983 0.325535
## atemp:mnth7 172.625 92.254 1.871 0.061336 .
## atemp:mnth8 272.623 88.129 3.093 0.001981 **
## atemp:mnth9 411.659 80.779 5.096 3.50e-07 ***
## atemp:mnth10 357.439 61.564 5.806 6.51e-09 ***
## atemp:mnth11 288.204 60.364 4.774 1.82e-06 ***
## atemp:mnth12 285.572 50.310 5.676 1.40e-08 ***
## temp:hr0 -75.269 27.563 -2.731 0.006325 **
## temp:hr1 -123.094 27.696 -4.444 8.87e-06 ***
## temp:hr2 -139.435 28.057 -4.970 6.77e-07 ***
## temp:hr3 -153.613 28.643 -5.363 8.29e-08 ***
## temp:hr4 -157.870 28.775 -5.486 4.16e-08 ***
## temp:hr5 -138.378 28.590 -4.840 1.31e-06 ***
## temp:hr6 -27.234 28.181 -0.966 0.333858
## temp:hr7 179.174 27.123 6.606 4.06e-11 ***
## temp:hr8 220.824 26.350 8.380 < 2e-16 ***
## temp:hr9 65.043 26.066 2.495 0.012594 *
## temp:hr10 90.520 26.011 3.480 0.000502 ***
## temp:hr11 145.219 26.249 5.532 3.21e-08 ***
## temp:hr12 196.763 26.416 7.449 9.87e-14 ***
## temp:hr13 197.233 26.662 7.398 1.45e-13 ***
## temp:hr14 190.124 26.853 7.080 1.49e-12 ***
## temp:hr15 195.200 26.954 7.242 4.61e-13 ***
## temp:hr16 315.517 26.923 11.719 < 2e-16 ***
## temp:hr17 569.891 26.684 21.357 < 2e-16 ***
## temp:hr18 565.865 26.565 21.301 < 2e-16 ***
## temp:hr19 434.331 26.655 16.295 < 2e-16 ***
## temp:hr20 303.636 26.738 11.356 < 2e-16 ***
## temp:hr21 191.067 26.924 7.097 1.33e-12 ***
## temp:hr22 100.882 27.066 3.727 0.000194 ***
## temp:hr23 NA NA NA NA
## hum:hr1 13.718 30.007 0.457 0.647545
## hum:hr2 26.178 30.329 0.863 0.388074
## hum:hr3 42.385 31.456 1.347 0.177856
## hum:hr4 54.370 31.816 1.709 0.087483 .
## hum:hr5 35.891 31.645 1.134 0.256748
## hum:hr6 16.070 31.122 0.516 0.605617
## hum:hr7 -32.245 30.647 -1.052 0.292745
## hum:hr8 -62.430 30.252 -2.064 0.039064 *
## hum:hr9 -46.046 29.542 -1.559 0.119095
## hum:hr10 -87.280 29.232 -2.986 0.002833 **
## hum:hr11 -116.807 29.030 -4.024 5.75e-05 ***
## hum:hr12 -166.615 28.952 -5.755 8.82e-09 ***
## hum:hr13 -185.503 28.951 -6.408 1.52e-10 ***
## hum:hr14 -187.703 28.800 -6.518 7.35e-11 ***
## hum:hr15 -182.570 28.561 -6.392 1.68e-10 ***
## hum:hr16 -201.586 28.365 -7.107 1.23e-12 ***
## hum:hr17 -254.651 28.105 -9.061 < 2e-16 ***
## hum:hr18 -215.800 28.146 -7.667 1.85e-14 ***
## hum:hr19 -174.028 28.186 -6.174 6.80e-10 ***
## hum:hr20 -111.621 28.358 -3.936 8.31e-05 ***
## hum:hr21 -72.597 28.724 -2.527 0.011500 *
## hum:hr22 -41.303 28.889 -1.430 0.152820
## hum:hr23 -9.223 29.302 -0.315 0.752947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91.56 on 17249 degrees of freedom
## Multiple R-squared: 0.7471, Adjusted R-squared: 0.7452
## F-statistic: 395 on 129 and 17249 DF, p-value: < 2.2e-16
The linear regression model explains about 74.5 of the variability of the data. This value has already been calculated in the section “Model development”.
gam.bike.1 <- gam(cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season + hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike)
summary(gam.bike.1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) +
## as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season +
## hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.398 16.243 -2.672 0.007553 **
## as.factor(season)2 74.999 17.194 4.362 1.30e-05 ***
## as.factor(season)3 293.571 38.505 7.624 2.58e-14 ***
## as.factor(season)4 195.233 16.481 11.846 < 2e-16 ***
## as.factor(yr)1 171.915 4.985 34.488 < 2e-16 ***
## as.factor(mnth)2 11.712 10.671 1.098 0.272415
## as.factor(mnth)3 -20.249 13.820 -1.465 0.142874
## as.factor(mnth)4 -10.974 23.573 -0.466 0.641566
## as.factor(mnth)5 19.163 30.338 0.632 0.527623
## as.factor(mnth)6 9.738 37.840 0.257 0.796904
## as.factor(mnth)7 -46.450 50.499 -0.920 0.357685
## as.factor(mnth)8 -117.640 47.115 -2.497 0.012539 *
## as.factor(mnth)9 -183.342 40.241 -4.556 5.25e-06 ***
## as.factor(mnth)10 -139.991 23.235 -6.025 1.72e-09 ***
## as.factor(mnth)11 -123.205 20.955 -5.880 4.19e-09 ***
## as.factor(mnth)12 -97.427 15.571 -6.257 4.02e-10 ***
## as.factor(hr)1 -7.863 22.719 -0.346 0.729288
## as.factor(hr)2 -20.528 22.883 -0.897 0.369683
## as.factor(hr)3 -37.976 23.592 -1.610 0.107486
## as.factor(hr)4 -51.447 23.989 -2.145 0.031995 *
## as.factor(hr)5 -32.069 23.623 -1.358 0.174623
## as.factor(hr)6 -8.429 23.330 -0.361 0.717871
## as.factor(hr)7 73.174 23.363 3.132 0.001739 **
## as.factor(hr)8 214.416 23.350 9.183 < 2e-16 ***
## as.factor(hr)9 130.314 22.802 5.715 1.12e-08 ***
## as.factor(hr)10 94.999 22.551 4.213 2.54e-05 ***
## as.factor(hr)11 111.927 22.270 5.026 5.06e-07 ***
## as.factor(hr)12 153.188 22.199 6.901 5.36e-12 ***
## as.factor(hr)13 157.793 22.238 7.096 1.34e-12 ***
## as.factor(hr)14 148.261 22.286 6.653 2.96e-11 ***
## as.factor(hr)15 153.379 22.270 6.887 5.89e-12 ***
## as.factor(hr)16 156.534 22.178 7.058 1.75e-12 ***
## as.factor(hr)17 198.021 22.045 8.982 < 2e-16 ***
## as.factor(hr)18 154.107 21.916 7.032 2.12e-12 ***
## as.factor(hr)19 96.895 21.859 4.433 9.36e-06 ***
## as.factor(hr)20 49.630 21.866 2.270 0.023237 *
## as.factor(hr)21 33.095 22.013 1.503 0.132740
## as.factor(hr)22 19.122 22.056 0.867 0.385964
## as.factor(hr)23 5.411 22.289 0.243 0.808201
## as.factor(weathersit)2 -7.856 1.751 -4.486 7.32e-06 ***
## as.factor(weathersit)3 -54.632 3.073 -17.780 < 2e-16 ***
## as.factor(weathersit)4 -25.399 53.270 -0.477 0.633514
## atemp:season1 64.728 27.234 2.377 0.017478 *
## atemp:season2 106.303 21.106 5.037 4.79e-07 ***
## atemp:season3 -162.859 43.682 -3.728 0.000193 ***
## atemp:season4 -27.225 29.585 -0.920 0.357465
## season1:hum 98.507 7.087 13.900 < 2e-16 ***
## season2:hum -3.243 7.487 -0.433 0.664897
## season3:hum -83.543 10.436 -8.005 1.27e-15 ***
## season4:hum -33.359 7.958 -4.192 2.78e-05 ***
## hum:yr1 -140.056 7.643 -18.324 < 2e-16 ***
## atemp:mnth2 -3.675 36.874 -0.100 0.920614
## atemp:mnth3 136.612 41.313 3.307 0.000946 ***
## atemp:mnth4 74.756 60.744 1.231 0.218461
## atemp:mnth5 51.090 69.040 0.740 0.459305
## atemp:mnth6 72.643 77.665 0.935 0.349630
## atemp:mnth7 181.354 91.866 1.974 0.048384 *
## atemp:mnth8 279.218 87.749 3.182 0.001465 **
## atemp:mnth9 413.787 80.584 5.135 2.85e-07 ***
## atemp:mnth10 362.589 61.593 5.887 4.01e-09 ***
## atemp:mnth11 282.847 60.284 4.692 2.73e-06 ***
## atemp:mnth12 278.221 50.276 5.534 3.18e-08 ***
## temp:hr0 -207.137 19.211 -10.782 < 2e-16 ***
## temp:hr1 -254.734 19.530 -13.043 < 2e-16 ***
## temp:hr2 -271.327 20.035 -13.543 < 2e-16 ***
## temp:hr3 -286.177 20.814 -13.749 < 2e-16 ***
## temp:hr4 -290.955 20.987 -13.864 < 2e-16 ***
## temp:hr5 -272.603 20.757 -13.133 < 2e-16 ***
## temp:hr6 -161.400 20.234 -7.977 1.60e-15 ***
## temp:hr7 46.172 18.804 2.455 0.014079 *
## temp:hr8 89.153 17.707 5.035 4.83e-07 ***
## temp:hr9 -66.912 17.214 -3.887 0.000102 ***
## temp:hr10 -42.833 16.954 -2.526 0.011532 *
## temp:hr11 11.794 17.131 0.688 0.491165
## temp:hr12 63.955 17.239 3.710 0.000208 ***
## temp:hr13 64.724 17.506 3.697 0.000219 ***
## temp:hr14 56.961 17.709 3.217 0.001300 **
## temp:hr15 61.397 17.820 3.445 0.000572 ***
## temp:hr16 181.538 17.811 10.193 < 2e-16 ***
## temp:hr17 436.159 17.556 24.844 < 2e-16 ***
## temp:hr18 431.602 17.491 24.675 < 2e-16 ***
## temp:hr19 301.448 17.769 16.965 < 2e-16 ***
## temp:hr20 171.746 18.000 9.541 < 2e-16 ***
## temp:hr21 59.350 18.371 3.231 0.001237 **
## temp:hr22 -30.904 18.631 -1.659 0.097192 .
## temp:hr23 -131.624 18.984 -6.933 4.26e-12 ***
## hum:hr1 14.003 29.987 0.467 0.640530
## hum:hr2 26.977 30.311 0.890 0.373486
## hum:hr3 43.368 31.434 1.380 0.167713
## hum:hr4 55.351 31.795 1.741 0.081728 .
## hum:hr5 37.101 31.620 1.173 0.240682
## hum:hr6 18.252 31.092 0.587 0.557190
## hum:hr7 -30.967 30.621 -1.011 0.311888
## hum:hr8 -60.825 30.227 -2.012 0.044204 *
## hum:hr9 -45.273 29.524 -1.533 0.125190
## hum:hr10 -86.223 29.215 -2.951 0.003168 **
## hum:hr11 -114.445 29.011 -3.945 8.02e-05 ***
## hum:hr12 -164.486 28.933 -5.685 1.33e-08 ***
## hum:hr13 -182.998 28.932 -6.325 2.59e-10 ***
## hum:hr14 -185.077 28.781 -6.431 1.30e-10 ***
## hum:hr15 -179.969 28.544 -6.305 2.95e-10 ***
## hum:hr16 -199.170 28.344 -7.027 2.19e-12 ***
## hum:hr17 -252.483 28.083 -8.991 < 2e-16 ***
## hum:hr18 -214.538 28.126 -7.628 2.51e-14 ***
## hum:hr19 -172.730 28.166 -6.132 8.84e-10 ***
## hum:hr20 -110.503 28.340 -3.899 9.69e-05 ***
## hum:hr21 -72.256 28.706 -2.517 0.011843 *
## hum:hr22 -40.665 28.876 -1.408 0.159067
## hum:hr23 -9.072 29.285 -0.310 0.756722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(hum) 8.801 8.973 10.864 <2e-16 ***
## s(temp) 6.954 7.939 16.889 <2e-16 ***
## s(atemp) 5.788 6.908 1.734 0.0933 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 133/136
## R-sq.(adj) = 0.745 Deviance explained = 74.7%
## GCV = 8437.1 Scale est. = 8375.2 n = 17379
The non-linear regression model shows the same result as the linear regression model. The variability of the data is explained by 74.5%.
poi.bike.1 <- glm(final.model.1, data = d.bike)
summary(poi.bike.1)
##
## Call:
## glm(formula = final.model.1, data = d.bike)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -418.83 -51.19 -2.54 41.54 431.19
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -143.267 25.714 -5.572 2.56e-08 ***
## as.factor(season)2 73.324 17.205 4.262 2.04e-05 ***
## as.factor(season)3 298.104 38.588 7.725 1.18e-14 ***
## as.factor(season)4 195.091 16.508 11.818 < 2e-16 ***
## as.factor(yr)1 172.268 4.990 34.521 < 2e-16 ***
## as.factor(mnth)2 10.876 10.721 1.014 0.310367
## as.factor(mnth)3 -20.106 13.838 -1.453 0.146264
## as.factor(mnth)4 -9.261 23.624 -0.392 0.695051
## as.factor(mnth)5 21.410 30.430 0.704 0.481711
## as.factor(mnth)6 7.077 38.071 0.186 0.852537
## as.factor(mnth)7 -40.325 50.737 -0.795 0.426750
## as.factor(mnth)8 -113.138 47.319 -2.391 0.016815 *
## as.factor(mnth)9 -182.423 40.325 -4.524 6.11e-06 ***
## as.factor(mnth)10 -136.993 23.215 -5.901 3.68e-09 ***
## as.factor(mnth)11 -124.801 20.972 -5.951 2.72e-09 ***
## as.factor(mnth)12 -99.478 15.568 -6.390 1.70e-10 ***
## as.factor(hr)1 -7.430 22.747 -0.327 0.743961
## as.factor(hr)2 -20.040 22.900 -0.875 0.381518
## as.factor(hr)3 -37.708 23.616 -1.597 0.110357
## as.factor(hr)4 -51.305 24.012 -2.137 0.032644 *
## as.factor(hr)5 -32.269 23.650 -1.364 0.172442
## as.factor(hr)6 -7.857 23.362 -0.336 0.736631
## as.factor(hr)7 73.630 23.393 3.148 0.001649 **
## as.factor(hr)8 215.553 23.377 9.221 < 2e-16 ***
## as.factor(hr)9 130.726 22.817 5.729 1.02e-08 ***
## as.factor(hr)10 94.937 22.567 4.207 2.60e-05 ***
## as.factor(hr)11 112.811 22.286 5.062 4.19e-07 ***
## as.factor(hr)12 154.479 22.217 6.953 3.70e-12 ***
## as.factor(hr)13 159.391 22.255 7.162 8.27e-13 ***
## as.factor(hr)14 149.476 22.303 6.702 2.12e-11 ***
## as.factor(hr)15 154.276 22.290 6.921 4.63e-12 ***
## as.factor(hr)16 157.121 22.198 7.078 1.52e-12 ***
## as.factor(hr)17 198.580 22.063 9.001 < 2e-16 ***
## as.factor(hr)18 153.956 21.929 7.021 2.29e-12 ***
## as.factor(hr)19 97.181 21.871 4.443 8.91e-06 ***
## as.factor(hr)20 50.504 21.879 2.308 0.020994 *
## as.factor(hr)21 33.505 22.027 1.521 0.128254
## as.factor(hr)22 19.421 22.067 0.880 0.378822
## as.factor(hr)23 5.565 22.301 0.250 0.802963
## as.factor(weathersit)2 -7.823 1.752 -4.464 8.10e-06 ***
## as.factor(weathersit)3 -54.424 3.076 -17.694 < 2e-16 ***
## as.factor(weathersit)4 -23.638 53.297 -0.444 0.657400
## poly(hum, degree = 8.7)1 957.555 586.675 1.632 0.102661
## poly(hum, degree = 8.7)2 -914.302 120.837 -7.566 4.03e-14 ***
## poly(hum, degree = 8.7)3 -177.031 98.917 -1.790 0.073521 .
## poly(hum, degree = 8.7)4 384.352 99.925 3.846 0.000120 ***
## poly(hum, degree = 8.7)5 -112.904 95.333 -1.184 0.236309
## poly(hum, degree = 8.7)6 -22.383 93.126 -0.240 0.810065
## poly(hum, degree = 8.7)7 64.497 92.906 0.694 0.487556
## poly(hum, degree = 8.7)8 -377.214 92.651 -4.071 4.69e-05 ***
## poly(temp, degree = 8)1 -2134.381 1133.604 -1.883 0.059741 .
## poly(temp, degree = 8)2 -3474.213 464.949 -7.472 8.26e-14 ***
## poly(temp, degree = 8)3 -2777.949 417.254 -6.658 2.87e-11 ***
## poly(temp, degree = 8)4 -457.491 270.972 -1.688 0.091365 .
## poly(temp, degree = 8)5 213.518 248.039 0.861 0.389347
## poly(temp, degree = 8)6 328.755 198.040 1.660 0.096925 .
## poly(temp, degree = 8)7 177.723 136.542 1.302 0.193071
## poly(temp, degree = 8)8 -55.473 124.113 -0.447 0.654912
## poly(atemp, degree = 8.9)1 -1740.835 1685.322 -1.033 0.301647
## poly(atemp, degree = 8.9)2 1224.997 461.324 2.655 0.007929 **
## poly(atemp, degree = 8.9)3 728.013 390.406 1.865 0.062233 .
## poly(atemp, degree = 8.9)4 -371.555 259.581 -1.431 0.152344
## poly(atemp, degree = 8.9)5 -294.540 246.098 -1.197 0.231384
## poly(atemp, degree = 8.9)6 8.645 195.419 0.044 0.964715
## poly(atemp, degree = 8.9)7 229.686 167.824 1.369 0.171138
## poly(atemp, degree = 8.9)8 59.002 147.736 0.399 0.689622
## atemp:season1 91.303 40.433 2.258 0.023949 *
## atemp:season2 137.184 37.128 3.695 0.000221 ***
## atemp:season3 -136.185 66.492 -2.048 0.040563 *
## atemp:season4 NA NA NA NA
## season1:hum 132.432 11.467 11.549 < 2e-16 ***
## season2:hum 30.183 11.509 2.623 0.008733 **
## season3:hum -56.517 14.683 -3.849 0.000119 ***
## season4:hum NA NA NA NA
## hum:yr1 -140.467 7.649 -18.363 < 2e-16 ***
## atemp:mnth2 -1.972 37.042 -0.053 0.957546
## atemp:mnth3 137.136 41.381 3.314 0.000922 ***
## atemp:mnth4 71.317 60.877 1.172 0.241412
## atemp:mnth5 46.588 69.212 0.673 0.500884
## atemp:mnth6 76.765 78.079 0.983 0.325535
## atemp:mnth7 172.625 92.254 1.871 0.061336 .
## atemp:mnth8 272.623 88.129 3.093 0.001981 **
## atemp:mnth9 411.659 80.779 5.096 3.50e-07 ***
## atemp:mnth10 357.439 61.564 5.806 6.51e-09 ***
## atemp:mnth11 288.204 60.364 4.774 1.82e-06 ***
## atemp:mnth12 285.572 50.310 5.676 1.40e-08 ***
## temp:hr0 -75.269 27.563 -2.731 0.006325 **
## temp:hr1 -123.094 27.696 -4.444 8.87e-06 ***
## temp:hr2 -139.435 28.057 -4.970 6.77e-07 ***
## temp:hr3 -153.613 28.643 -5.363 8.29e-08 ***
## temp:hr4 -157.870 28.775 -5.486 4.16e-08 ***
## temp:hr5 -138.378 28.590 -4.840 1.31e-06 ***
## temp:hr6 -27.234 28.181 -0.966 0.333858
## temp:hr7 179.174 27.123 6.606 4.06e-11 ***
## temp:hr8 220.824 26.350 8.380 < 2e-16 ***
## temp:hr9 65.043 26.066 2.495 0.012594 *
## temp:hr10 90.520 26.011 3.480 0.000502 ***
## temp:hr11 145.219 26.249 5.532 3.21e-08 ***
## temp:hr12 196.763 26.416 7.449 9.87e-14 ***
## temp:hr13 197.233 26.662 7.398 1.45e-13 ***
## temp:hr14 190.124 26.853 7.080 1.49e-12 ***
## temp:hr15 195.200 26.954 7.242 4.61e-13 ***
## temp:hr16 315.517 26.923 11.719 < 2e-16 ***
## temp:hr17 569.891 26.684 21.357 < 2e-16 ***
## temp:hr18 565.865 26.565 21.301 < 2e-16 ***
## temp:hr19 434.331 26.655 16.295 < 2e-16 ***
## temp:hr20 303.636 26.738 11.356 < 2e-16 ***
## temp:hr21 191.067 26.924 7.097 1.33e-12 ***
## temp:hr22 100.882 27.066 3.727 0.000194 ***
## temp:hr23 NA NA NA NA
## hum:hr1 13.718 30.007 0.457 0.647545
## hum:hr2 26.178 30.329 0.863 0.388074
## hum:hr3 42.385 31.456 1.347 0.177856
## hum:hr4 54.370 31.816 1.709 0.087483 .
## hum:hr5 35.891 31.645 1.134 0.256748
## hum:hr6 16.070 31.122 0.516 0.605617
## hum:hr7 -32.245 30.647 -1.052 0.292745
## hum:hr8 -62.430 30.252 -2.064 0.039064 *
## hum:hr9 -46.046 29.542 -1.559 0.119095
## hum:hr10 -87.280 29.232 -2.986 0.002833 **
## hum:hr11 -116.807 29.030 -4.024 5.75e-05 ***
## hum:hr12 -166.615 28.952 -5.755 8.82e-09 ***
## hum:hr13 -185.503 28.951 -6.408 1.52e-10 ***
## hum:hr14 -187.703 28.800 -6.518 7.35e-11 ***
## hum:hr15 -182.570 28.561 -6.392 1.68e-10 ***
## hum:hr16 -201.586 28.365 -7.107 1.23e-12 ***
## hum:hr17 -254.651 28.105 -9.061 < 2e-16 ***
## hum:hr18 -215.800 28.146 -7.667 1.85e-14 ***
## hum:hr19 -174.028 28.186 -6.174 6.80e-10 ***
## hum:hr20 -111.621 28.358 -3.936 8.31e-05 ***
## hum:hr21 -72.597 28.724 -2.527 0.011500 *
## hum:hr22 -41.303 28.889 -1.430 0.152820
## hum:hr23 -9.223 29.302 -0.315 0.752947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 8383.607)
##
## Null deviance: 571761591 on 17378 degrees of freedom
## Residual deviance: 144608842 on 17249 degrees of freedom
## AIC: 206453
##
## Number of Fisher Scoring iterations: 2
To enable a meaningful comparison between the three models, the models should train 10 times with the training set and then predict the new data. Finally, the corresponding R-Squared of all three models is compared to each other to make a statement about the best model.
for(i in 1:10){
d.bike.train.id <- sample(seq_len(nrow(d.bike)),size = floor(0.75*nrow(d.bike)))
d.bike.train <- d.bike[d.bike.train.id,]
d.bike.test <- d.bike[-d.bike.train.id,]
#predict data with linear model
lm.bike.1.train <- lm(final.model.1, data = d.bike.train)
predicted.lm.bike.1.test <- predict(lm.bike.1.train,
newdata = d.bike.test)
r.squared.lm.bike.1 <- cor(predicted.lm.bike.1.test, d.bike.test$cnt)^2
#predict data with non-linear model
gam.bike.1.train <- gam(cnt ~ as.factor(season) + as.factor(yr) + as.factor(mnth) + as.factor(hr) + as.factor(weathersit) + s(hum) + s(temp) + s(atemp) + atemp:season + hum:season + hum:yr + atemp:mnth + temp:hr + hum:hr, data = d.bike.train)
predicted.gam.bike.1.test <- predict(gam.bike.1.train,
newdata = d.bike.test)
r.squared.gam.bike.1 <- cor(predicted.gam.bike.1.test, d.bike.test$cnt)^2
#predict data with poisson model
poi.bike.1.train <- glm(final.model.1, data = d.bike)
predicted.poi.bike.1.test <- predict(poi.bike.1.train,
newdata = d.bike.test)
r.squared.poi.bike.1 <- cor(predicted.poi.bike.1.test, d.bike.test$cnt)^2
}
mean(r.squared.lm.bike.1)
## [1] 0.7286108
mean(r.squared.gam.bike.1)
## [1] 0.728941
mean(r.squared.poi.bike.1)
## [1] 0.7328381
The result is quite close. But the best result is achieved by the poisson regressive model with an average R-Squared.
In this chapter a classification tree and a regression tree are used to predict cnt. Once as class and once as discrete value. Afterwards, the regression tree model is used to analyse the bagging, random forests and boosting procedures.
To be able to predict a class using the classification tree, a class must first be created. For this purpose five classes are formed first. The classes indicate whether few (0) or many (5) bicycles were rented at the observed hour. The classes are formed according to a simple principle. The maximum observed value of cnt is divided by 5. Thus five equally large intervals are formed. Finally, the training and test data set is initialized with the classified data.
max(d.bike.new$cnt)
## [1] 977
max(d.bike.new$cnt)/5
## [1] 195.4
for(i in 1:nrow(d.bike)){
if(d.bike.new$cnt[i] >= 0 & d.bike.new$cnt[i] < 196){
d.bike.new$class[i] <- 0
} else if (d.bike.new$cnt[i] >= 196 & d.bike.new$cnt[i] < 391){
d.bike.new$class[i] <- 1
} else if (d.bike.new$cnt[i] >= 391 & d.bike.new$cnt[i] < 586){
d.bike.new$class[i] <- 2
} else if (d.bike.new$cnt[i] >= 586 & d.bike.new$cnt[i] < 781){
d.bike.new$class[i] <- 3
} else if (d.bike.new$cnt[i] >= 781 & d.bike.new$cnt[i] <= 977){
d.bike.new$class[i] <- 4
}
}
d.bike.train.id <- sample(seq_len(nrow(d.bike.new)),size = floor(0.75*nrow(d.bike.new)))
d.bike.train.new <- d.bike.new[d.bike.train.id,]
d.bike.test.new <- d.bike.new[-d.bike.train.id,]
On the basis of the five defined classes, a classification tree is now fitted on the training data set. That it is a classification tree is shown by the y-variable class, which is loaded into the model as a factor. Cnt is removed from the model because it correlates too strongly with class.
tree.classification.bike.1 <- tree(as.factor(class) ~.-cnt, data = d.bike.train.new)
summary(tree.classification.bike.1)
##
## Classification tree:
## tree(formula = as.factor(class) ~ . - cnt, data = d.bike.train.new)
## Variables actually used in tree construction:
## [1] "hr" "temp" "yr"
## Number of terminal nodes: 10
## Residual mean deviance: 1.245 = 16210 / 13020
## Misclassification error rate: 0.2774 = 3616 / 13034
The result shows the structure of the classification tree. There are 10 terminal nodes available. Not all variables were used to create the tree. Used were: hr, temp and yr. Now the tree should be plotted visually.
plot(tree.classification.bike.1)
text(tree.classification.bike.1, pretty=1, cex=0.75)
As can be seen, the attribute hr is the root and therefore the most important attribute. As the summary has already shown, only hr, yr and temp are used for the classification. Furthermore, there are only 10 leaves or end nodes.
Now a prediction is to be made on the already known data, so that hits and error rate can be calculated afterwards.
tree.classification.bike.pred.train <- predict(tree.classification.bike.1, d.bike.train.new, type="class")
(tree.classification.bike.pred.train.ct <- table(tree.classification.bike.pred.train, as.factor(d.bike.train.new$class)))
##
## tree.classification.bike.pred.train 0 1 2 3 4
## 0 7587 1684 227 5 0
## 1 279 1340 624 124 0
## 2 107 146 314 112 4
## 3 38 38 89 177 139
## 4 0 0 0 0 0
tree.classification.bike.pred.train.correct <- 0
tree.classification.bike.pred.train.error <- 0
for (i1 in 1:3) {
for (i2 in 1:3) {
if (i1 == i2) {
tree.classification.bike.pred.train.correct <- tree.classification.bike.pred.train.correct + tree.classification.bike.pred.train.ct[i1,i2]
}else{
tree.classification.bike.pred.train.error <- tree.classification.bike.pred.train.error + tree.classification.bike.pred.train.ct[i1,i2]
}
}
}
(tree.classification.bike.pred.train.rate <- tree.classification.bike.pred.train.correct/sum(tree.classification.bike.pred.train.ct))
## [1] 0.7089919
(tree.classification.bike.pred.train.error <- 1 - tree.classification.bike.pred.train.rate)
## [1] 0.2910081
The table shows that the classification makes many errors. For example, data records are classified as 0 although they are not class 0. The hits and error rate confirms this. About 70% of the data objects are correctly classified and about 30% are incorrectly classified. What does the result now look like on the test data set?
tree.classification.bike.pred.test <- predict(tree.classification.bike.1, d.bike.test.new, type="class")
(tree.classification.bike.pred.test.ct <- table(tree.classification.bike.pred.test, as.factor(d.bike.test.new$class)))
##
## tree.classification.bike.pred.test 0 1 2 3 4
## 0 2502 624 66 0 0
## 1 80 441 197 36 0
## 2 36 54 105 39 3
## 3 13 14 34 69 32
## 4 0 0 0 0 0
tree.classification.bike.pred.test.correct <- 0
tree.classification.bike.pred.test.error <- 0
for (i1 in 1:3) {
for (i2 in 1:3) {
if (i1 == i2) {
tree.classification.bike.pred.test.correct <- tree.classification.bike.pred.test.correct + tree.classification.bike.pred.test.ct[i1,i2]
}else{
tree.classification.bike.pred.test.error <- tree.classification.bike.pred.test.error + tree.classification.bike.pred.test.ct[i1,i2]
}
}
}
(tree.classification.bike.pred.test.rate <- tree.classification.bike.pred.test.correct/sum(tree.classification.bike.pred.test.ct))
## [1] 0.701496
(tree.classification.bike.pred.test.error <- 1 - tree.classification.bike.pred.test.rate)
## [1] 0.298504
The result is basically the same. The test data set is also classified with a hit rate of 70% and an error rate of 30%. Pruning could possibly improve this result.
tree.classification.bike.pruning <- cv.tree(tree.classification.bike.1, FUN = prune.misclass)
plot(tree.classification.bike.pruning$size, tree.classification.bike.pruning$dev, type="b")
plot(tree.classification.bike.pruning$k, tree.classification.bike.pruning$dev, type="b")
The graph shows that probably flattens from a size of 7. So for the parameter “best” 7 is chosen. Now the pruning can be performed.
prune.tree.classification.bike <- prune.misclass(tree.classification.bike.1, best=7)
summary(prune.tree.classification.bike)
##
## Classification tree:
## snip.tree(tree = tree.classification.bike.1, nodes = c(2L, 6L
## ))
## Variables actually used in tree construction:
## [1] "hr" "temp" "yr"
## Number of terminal nodes: 7
## Residual mean deviance: 1.333 = 17360 / 13030
## Misclassification error rate: 0.2787 = 3632 / 13034
This procedure enabled the tree to be pruned by three leaves. What does the pruned tree look like visually?
plot(prune.tree.classification.bike)
text(prune.tree.classification.bike,pretty=0)
hr is still the most important attribute and therefore the root of the tree. The left branch of the initial tree now disappears completely. The right branches could also be reduced. Now the performance prediction on the test data set is to be checked, so that a comparison with the initial tree can be made.
prune.tree.classification.bike.pred.test <- predict(prune.tree.classification.bike, d.bike.test.new, type="class")
(prune.tree.classification.bike.pred.test.ct <- table(prune.tree.classification.bike.pred.test, as.factor(d.bike.test.new$class)))
##
## prune.tree.classification.bike.pred.test 0 1 2 3 4
## 0 2536 645 101 21 3
## 1 80 441 197 36 0
## 2 2 33 70 18 0
## 3 13 14 34 69 32
## 4 0 0 0 0 0
(prune.tree.classification.bike.pred.test.correct <- sum(prune.tree.classification.bike.pred.test==as.factor(d.bike.test.new$class))/sum(prune.tree.classification.bike.pred.test.ct))
## [1] 0.7171461
(prune.tree.classification.bike.pred.test.error <- 1 - prune.tree.classification.bike.pred.test.correct)
## [1] 0.2828539
tree.classification.bike.pred.test.rate
## [1] 0.701496
tree.classification.bike.pred.test.error
## [1] 0.298504
The table as well as the error and hit rates show that even the trimmed tree still makes errors in the classification. The improvement is - as can be seen from the lower two values - minimal.
Count values are not bell-shaped distributed
hist(d.bike.train.new$cnt)
in this section the variable “cnt” will be predicted (–> factor)
table(d.bike.train.new$cnt)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 105 151 168 167 205 179 149 142 93 122 106 88 93 76 64 82 82 62 50 68
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 72 51 61 58 52 72 55 59 44 45 55 54 45 47 49 48 41 39 43 47
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 41 44 35 38 38 37 40 30 28 45 35 39 33 34 36 34 35 25 36 31
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 27 41 32 51 30 28 24 32 42 33 39 37 26 25 40 28 24 43 39 26
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 25 22 29 40 25 41 34 32 30 46 33 35 33 30 37 32 27 28 30 19
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 24 33 28 31 30 37 27 41 27 29 24 32 33 40 23 34 28 39 30 26
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 31 31 41 34 29 32 30 22 36 28 22 18 30 29 37 29 27 28 26 30
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 34 27 23 23 29 22 32 29 23 29 32 33 38 35 23 26 31 20 32 22
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 26 29 32 23 35 24 26 29 27 28 29 30 20 29 27 22 33 29 25 26
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 32 33 20 23 33 19 26 30 27 37 20 23 21 26 25 25 23 24 16 17
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 25 33 27 27 26 26 28 25 26 21 28 17 21 22 16 19 25 25 26 24
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 27 24 23 30 21 24 22 25 29 26 12 29 32 18 22 17 26 24 21 19
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 20 13 23 20 20 13 14 21 15 17 21 12 19 22 14 22 19 14 17 24
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 16 11 23 21 12 14 20 19 15 24 15 21 14 26 19 25 9 12 13 20
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 20 20 22 11 26 19 15 13 14 15 17 16 10 15 11 15 18 15 20 24
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 14 17 15 17 14 14 10 24 10 15 13 14 24 17 16 18 13 10 8 13
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 15 12 16 6 11 11 13 26 9 14 20 19 10 14 10 11 16 15 6 11
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 11 14 17 9 8 10 13 5 12 7 11 10 11 9 13 12 11 12 13 15
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 11 13 13 5 10 6 16 5 10 8 10 10 14 16 8 14 14 11 8 13
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 11 13 12 8 11 7 12 8 9 10 9 8 3 9 8 10 9 13 7 8
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 10 15 6 16 8 9 9 10 9 16 7 7 10 9 6 11 7 8 8 12
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 14 3 8 7 9 5 9 10 2 4 8 12 6 6 8 7 5 4 5 6
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 8 6 6 4 7 9 8 5 8 7 4 13 7 7 12 11 9 6 7 6
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 5 9 8 10 6 10 7 8 6 9 7 9 7 5 4 6 1 7 9 5
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 8 11 8 2 3 7 8 5 7 4 10 11 9 8 11 5 9 6 10 6
## 501 502 503 504 505 506 507 508 509 511 512 513 514 515 516 517 518 519 520 521
## 5 6 7 7 9 2 6 9 6 4 5 8 5 4 6 6 4 3 5 7
## 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541
## 4 3 3 6 5 5 5 5 7 8 2 1 2 2 6 4 3 8 5 7
## 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561
## 4 6 4 2 6 8 1 3 5 6 5 4 3 7 8 3 5 6 7 2
## 562 563 564 565 566 567 568 569 570 571 572 573 575 576 577 578 579 580 581 582
## 4 7 4 6 3 5 8 6 6 3 3 5 1 5 2 4 6 2 4 6
## 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602
## 2 4 7 5 1 2 3 7 10 2 4 5 2 7 2 2 2 1 3 2
## 603 604 605 606 607 608 609 610 611 612 614 615 616 617 618 619 620 621 622 623
## 2 3 4 2 4 3 2 4 2 2 6 3 2 4 2 1 2 1 3 3
## 624 625 626 627 628 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646
## 3 2 6 6 4 3 2 3 4 1 1 5 2 4 4 3 4 4 2 6
## 647 648 649 650 651 652 653 654 655 656 657 658 659 660 662 663 664 665 666 667
## 3 2 4 1 4 1 5 7 2 2 2 2 2 1 4 2 2 3 2 2
## 668 669 670 671 672 673 674 675 676 678 679 680 681 682 683 684 685 686 687 688
## 6 1 1 4 2 3 2 1 1 3 3 3 4 2 3 1 1 5 2 2
## 689 690 691 692 693 694 696 697 698 699 700 702 703 704 705 706 707 708 709 710
## 2 2 5 2 4 3 1 1 2 2 2 2 2 2 3 1 1 1 1 3
## 711 712 713 714 715 717 719 721 722 723 724 727 729 730 731 732 733 734 736 737
## 4 2 4 1 3 1 2 2 2 3 3 1 4 2 3 2 1 1 1 2
## 738 740 741 743 744 745 747 748 749 750 754 755 757 758 759 760 761 766 767 769
## 1 1 2 3 3 2 1 1 1 4 1 1 3 1 2 2 1 2 1 1
## 770 771 772 774 775 776 777 779 781 783 784 785 786 788 790 791 792 794 795 797
## 3 1 2 1 1 1 1 1 2 1 2 2 1 2 2 2 1 2 2 1
## 800 801 804 805 806 808 809 810 811 812 813 814 817 818 819 820 822 823 824 825
## 2 4 1 2 1 4 2 2 2 6 2 2 1 2 2 2 3 2 1 1
## 827 830 832 833 834 835 837 838 839 843 844 845 846 849 851 852 853 854 856 857
## 3 1 1 1 2 3 2 2 4 2 2 1 2 1 1 1 2 1 1 2
## 858 862 864 865 868 869 870 872 873 877 878 884 886 888 890 891 892 893 894 897
## 3 1 1 1 1 1 1 2 1 2 1 2 2 1 1 2 1 1 1 1
## 898 900 905 917 922 925 938 941 943 948 953 963 967 968 970 976 977
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
“cnt” is a categorical variable -> normally classification tree is to be used. but since it has more than 32 levels, regression tree is to be generated.
Proof: classification did not work with “cnt” because at most 32 levels are possible. cnt has more than 1000 levels. Following code would not work:
#tree.classification.bike <- tree(as.factor(cnt) ~ .-class, data = d.bike.train.new)
#tree.classification.bike.pred <- predict(tree.classification.bike, d.bike.train.new, type="class")
regression tree is generated:
tree.regression.bike <- tree(cnt ~ .-class, data = d.bike.train.new)
plot(tree.regression.bike)
text(tree.regression.bike, pretty=1, cex=0.75)
tree has 10 nodes. See summary:
summary(tree.regression.bike)
##
## Regression tree:
## tree(formula = cnt ~ . - class, data = d.bike.train.new)
## Variables actually used in tree construction:
## [1] "hr" "temp" "yr"
## Number of terminal nodes: 10
## Residual mean deviance: 11230 = 146300000 / 13020
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -554.50 -50.47 -17.24 0.00 53.76 514.80
….regression tree is used for prediction. prediction is made based on training data
tree.regression.bike.train.pred <- predict(tree.regression.bike, d.bike.train.new, type="vector")
predictions of regression tree with true “cnt” values are compared. it shows a positive correlation see graphic:
plot(tree.regression.bike.train.pred,d.bike.train.new$cnt)
abline (0 ,1) # compare with the function f(x)=x (intercept 0, slope 1)
errors are calculated (error residual calculation: [predicted “cnt” values] - [real “cnt” values])
error <- tree.regression.bike.train.pred-d.bike.train.new$cnt
element_ID <- 1:length(error)
Analysis of the resuduals: The majority of the residents are within the frequency of 200 see graphic:
plot(element_ID,error)
title(main="Analysis of the residuals")
abline(0 ,0, lwd=5, col="skyblue", lty="dotted")
abline(200 ,0, lwd=5, col="red", lty="dotted")
abline(-200 ,0, lwd=5, col="red", lty="dotted")
Histogram of error: Most errors do not seem to be over 200 up and down (as already seen in the residence analysis above). The errors appear to be bell shaped. see graphic:
hist(error)
some numbers on train data RSS: 146263805 MSE: 11221.71 deviation: 105.9326 …calculated with following code:
(RSS <- sum((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2))
## [1] 146263805
(MSE <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2)$cnt))
## [1] 11221.71
(deviation <- sqrt(MSE))
## [1] 105.9326
a prediction is made based on test data
tree.regression.bike.test.pred <- predict(tree.regression.bike, d.bike.test.new, type="vector")
MSE: An MSE comparison is made between train and test data. Both MSE-values are very close to each other and this shows a very good performance of the tree. MSE of training data: 11221.71 MSE of test data: 10638.24
(MSE.train <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.train.pred)^2)$cnt))
## [1] 11221.71
(MSE.test <- mean(((d.bike.test.new["cnt"]-tree.regression.bike.test.pred)^2)$cnt))
## [1] 10638.24
Graphical comparison of error residuals between training and test data: result: As one can see in the graph, the distribution of error rates between training and test data looks the same –> good performance of the decision tree! see graphic:
errors.2.in <- predict(tree.regression.bike, d.bike.train.new, type="vector")-d.bike.train.new$cnt
element.2.in <- as.integer(names(errors.2.in))
errors.2.in_dataframe <- tibble(element.2.in,errors.2.in,"TRAIN")
colnames(errors.2.in_dataframe) <- c('ID','error','type')
errors.2 <- predict(tree.regression.bike, d.bike.test.new, type="vector")-d.bike.test.new$cnt
element.2 <- as.integer(names(errors.2))
errors.2.out_dataframe <- tibble(element.2,errors.2,"TEST")
colnames(errors.2.out_dataframe) <- c('ID','error','type')
errors.2_dataframe <- bind_rows(errors.2.in_dataframe,errors.2.out_dataframe)
errors.2_dataframe <- arrange(errors.2_dataframe, ID)
ggplot(data = errors.2_dataframe, mapping = aes(x = ID,y = error, color = type)) +
geom_point() + geom_boxplot(alpha = 0.5)
pruning of regression tree: The average deviance is influenced by the number of leaves. Up to four leaves the deviance decreases significantly. From then on the the deviance becomes stable See graphic:
tree.regression.bike.pruning = cv.tree(tree.regression.bike, FUN = prune.tree)
plot(tree.regression.bike.pruning)
cross-validation error-rate based on size and k:
In the graphic below one can see again, that from the 4th node on, the curve flattens out:
par(mfrow=c(1,2))
plot(tree.regression.bike.pruning$size, tree.regression.bike.pruning$dev, type="b")
On this graphic, one can see that after 5.0e+0.7 k, the curve flattens out:
plot(tree.regression.bike.pruning$k, tree.regression.bike.pruning$dev, type="b")
par(mfrow=c(1,1))
–> On this level the number of leaves should be chosen.
Based on the findings above, the tree was created once with 4 and once with 5 nodes. With the tree with 5 nodes, the goal was to find out whether the expected lower mean deviance would allow a better tree. Result: Both trees (generated below) are good, none of them has unnecessary decisions. see graphics of both trees:
Tree with 4 nodes:
tree.regression.bike.pruned <- prune.tree(tree.regression.bike, best = 5)
plot(tree.regression.bike.pruned)
text(tree.regression.bike.pruned, pretty=1, cex=0.75)
Tree with 5 nodes:
tree.regression.bike.pruned2 <- prune.tree(tree.regression.bike, best = 4)
plot(tree.regression.bike.pruned2)
text(tree.regression.bike.pruned2, pretty=1, cex=0.75)
Summary of both trees: mean deviance of tree with 5 leaves (“tree.regression.bike.pruned”): 14790 mean deviance of tree with 4 leaves (“tree.regression.bike.pruned2”): 16100 For the further part of the residual tree analysis (only within this section), the tree with five nodes is to be preferred, since it has a lower mean deviance. sidenote: in the summary of the trees, one can also see that only the dimensions “hr” and “temp” are used in both trees after the prune. summary of tree with 5 leaves:
summary(tree.regression.bike.pruned)
##
## Regression tree:
## snip.tree(tree = tree.regression.bike, nodes = c(15L, 6L, 14L
## ))
## Variables actually used in tree construction:
## [1] "hr" "temp"
## Number of terminal nodes: 5
## Residual mean deviance: 14790 = 192700000 / 13030
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -544.70 -64.02 -20.24 0.00 50.76 635.70
summary of tree with 4 leaves:
summary(tree.regression.bike.pruned2)
##
## Regression tree:
## snip.tree(tree = tree.regression.bike, nodes = c(15L, 6L, 14L,
## 2L))
## Variables actually used in tree construction:
## [1] "hr" "temp"
## Number of terminal nodes: 4
## Residual mean deviance: 16100 = 209800000 / 13030
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -544.70 -68.60 -31.12 0.00 60.37 635.70
Next was about prediction tests on the recently pruned tree with five nodes. Result: With train data we came to an MSE of 14785.88 and with test data to 14245.44. –> These two values are very close to each other and this shows a very good performance of the tree.
Prediction using pruned tree based on training data:
tree.regression.bike.pruned.train.pred <- predict(tree.regression.bike.pruned, d.bike.train.new, type="vector")
(MSE.pruned.train <- mean(((d.bike.train.new["cnt"]-tree.regression.bike.pruned.train.pred)^2)$cnt))
## [1] 14785.88
Prediction using pruned tree based on test data:
tree.regression.bike.pruned.test.pred <- predict(tree.regression.bike.pruned, d.bike.test.new, type="vector")
(MSE.pruned.test <- mean(((d.bike.test.new["cnt"]-tree.regression.bike.pruned.test.pred)^2)$cnt))
## [1] 14245.44
In this section, the bagging shall be performed on the regression tree model to reduce the variance. As y-variable not the class is used but cnt. The number of Bootstrap replication is set to 25. It should no cross validation be used as set with parameter coob.
bag.bike=bagging(cnt~.-class, data=d.bike.train.new, nbagg=25, coob =TRUE)
print(bag.bike)
##
## Bagging regression trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = cnt ~ . - class, data = d.bike.train.new,
## nbagg = 25, coob = TRUE)
##
## Out-of-bag estimate of root mean squared error: 51.1638
The root mean squared error (RMSE) is 51.1286. The mean squared error (MSE) is therefore about 2614.
Now prediction is to be performed by means of bagging on the test set. The result is plotted.
yhat.bag = predict(bag.bike,newdata=d.bike.test.new)
plot(yhat.bag, as.factor(d.bike.test.new$cnt))
abline(0,1)
The graphic representation shows us what the MSE has already shown us. The devinace from the estimate (straight line) and thus the variance is large.
mean((yhat.bag-d.bike.test.new$cnt)^2)
## [1] 2652.006
The calculation of the MSE from the forecast confirms this a second time.
In this section the random forest method is used for the already discussed analysis. However, here the “class” variable is used as the target value instead of the “cnt” variable. The random forest function could not handle cnt because it has too many levels (see https://stats.stackexchange.com/questions/37370/random-forest-computing-time-in-r) . Mtry is set to 6 for this anlysis.
The aim of this analysis is to calculate an MSE value again and to investigate the importance of the individual explaining variables.
run the initial function:
rf.bike=randomForest(as.numeric(as.character(class)) ~ .-cnt,data=d.bike.train.new, mtry=6,importance =TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
predict on test set:
yhat.rf = predict(rf.bike ,newdata=d.bike.test.new)
performance (MSE) result: the mse is at 0.273
mean((yhat.rf-d.bike.test.new$class)^2)
## [1] 0.2264341
With the two masses “increase of MSE in percent” and “inrease in node purity” it can be seen that “hr” is by far the most important variable. This can be seen in the table and graphic below:
importance(rf.bike)
## %IncMSE IncNodePurity
## season 64.31021 286.7481
## yr 229.76703 1028.8263
## mnth 73.34177 574.5719
## hr 504.17684 4329.8716
## weathersit 71.58506 277.5074
## temp 58.55109 1087.6280
## atemp 49.33936 941.7789
## hum 80.86539 889.7663
varImpPlot (rf.bike)
In this section Boosting will be used to improve the model. Since our data set is count data, we choose the Poisson distribution. The number of trees to be generated in boosting is set to 1000.
boost.bike=gbm(cnt~.-class,data=d.bike.train.new,
distribution="poisson",n.trees=1000, interaction.depth=4)
summary(boost.bike)
## var rel.inf
## hr hr 57.990601
## temp temp 10.750827
## yr yr 10.612623
## mnth mnth 5.998318
## atemp atemp 5.162263
## hum hum 4.030658
## season season 3.391951
## weathersit weathersit 2.062759
By far the most important attribute is hr. As the result shows the attributes yr and temp are also important. This is not surprising, since the tree developed in the section “Classification Tree” was already based on these three attributes.
plot(boost.bike ,i="hr")
plot(boost.bike ,i="yr")
The graphical representation of the two most important variables hr and yr shows a clear effect on cnt. The cnt is different in the individual hours as well as in the two years.
A prediction is performed to check performance on the test data record. As in the previous examples, performance is assessed using MSE.
yhat.boost=predict(boost.bike,newdata=d.bike.test.new, n.trees=1000)
mean((yhat.boost -d.bike.test.new$cnt)^2)
## [1] 65722.11
The MSE is very high at 66442. About thirty times as high as in the example of bagging (see Bagging). Adjustments to the shrinkage parameter have no great effect (see line 668, group13_bikesharing.R).
The fifth chapter deals with the classification of data using Support Vector Machines (SVM). Two SVMs are trained and tested. In the first case, the data is classified into two classes using SVM with two predictors. In the second case, the data is classified in five classes using the final model.
An interesting classification is the division of the data in cyclists without subscription (casual) into those who ride during cool temperatures and those who ride during very warm temperatures. The two attributes casual and breath are considered. In a plot, a quite centered seperating hyperplane is supposed to divide the data into two classes. For this purpose an intercept of the x-value atemp of 0.5525 is defined straight-forward, which separates the two types of cyclist visually.
xi <- 0.5525
ggplot(data = d.bike, mapping = aes(y = casual, x = atemp)) + geom_point() + geom_vline(xintercept = xi, color = "blue")
The scatterplot shows the blue hyperplane.
The next step is to classify the data so that the SVM model knows which classes to determine. This is a prerequisite in the classification. A simple if-statement can be defined for classification. If atemp is smaller than the vertical intercept defined in the visual example, it is class 0, i.e. those cyclists without a subscription who rent a bicycle even at cooler temperatures. If the observed atemp is greater than or equal to the defined intercept of 0.5525, then we are dealing with class 1, i.e. a cyclist who rents a bicycle even in warmer temperatures. A for-loop is used to perform the classification for each observed value.
for(i in 1:nrow(d.bike)){
if(d.bike$atemp[i] < 0.5525){
d.bike$class[i] <- 0
} else if (d.bike$atemp[i] >= 0.5525){
d.bike$class[i] <- 1
}
}
c.bike <- data.frame(x=d.bike, y=as.factor(d.bike$class))
ggplot(data = c.bike, mapping = aes(y = x.casual, x = x.atemp, color=y)) + geom_point() + geom_vline(xintercept = xi, size = 1, alpha = 0.5)
The figure shows the two classes separated by color.
Now that the classes and thus the target variable are defined, an SVM model can be fitted. As can be seen from the summary, 628 support vectors are available at the cost of 10. 316 of them are in class 0 and 312 in class 1.
svm.bike.1 <- svm(y~x.casual+x.atemp,
data = c.bike,
kernel = "linear",
cost = 10,
scale = FALSE)
summary(svm.bike.1)
##
## Call:
## svm(formula = y ~ x.casual + x.atemp, data = c.bike, kernel = "linear",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 628
##
## ( 316 312 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
plot(svm.bike.1, c.bike, x.casual~x.atemp)
The plot shows the classification and the support vectors.
In order to obtain the best model with optimal cost factor, a vector with different cost values is first defined and the tune-function is used.
cost_range <-
c(0.01,
0.1,
1,
5,
10,
100)
tune.out <- tune(
svm,
y ~ x.casual+x.atemp,
data = c.bike,
kernel = "linear",
ranges = list(cost = cost_range)
)
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.0104725354 0.0025566324
## 2 1e-01 0.0002301827 0.0002971646
## 3 1e+00 0.0000000000 0.0000000000
## 4 5e+00 0.0000000000 0.0000000000
## 5 1e+01 0.0000000000 0.0000000000
## 6 1e+02 0.0000000000 0.0000000000
As can be seen, with a cost factor of 1, an error rate of 0 is already possible.
svm.bike.1.best <- tune.out$best.model
summary(svm.bike.1.best)
##
## Call:
## best.tune(method = svm, train.x = y ~ x.casual + x.atemp, data = c.bike,
## ranges = list(cost = cost_range), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 366
##
## ( 184 182 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
plot(svm.bike.1.best, c.bike, x.casual~x.atemp)
The result of the best model shows us this. The best model has an optimal cost factor of 1 and consequently there are less support vectors. It is a little more than half of the first model.
table(predict = predict(svm.bike.1.best, c.bike),
truth = c.bike$y)
## truth
## predict 0 1
## 0 11148 0
## 1 0 6231
As the table shows, no errors are made in the prediction and the data is correctly classified.
In the second section of this chapter, a classification into five classes based on the final.model.1 will be made. The same classes are used that were already generated in the “Classification Tree” section. Due to the multi-dimensionality (more than 3 predictors), the SVM classification cannot be visualized.
To be able to fit an SVM model, the two data sets must be transformed.The class is removed from the x-variables and assigned as factor to the y-variable, i.e. the target variable.
d.bike.train.svm <- data.frame(x=subset(d.bike.train.new, select=-c(class)), y=as.factor(d.bike.train.new$class))
d.bike.test.svm <- data.frame(x=subset(d.bike.test.new, select=-c(class)), y=as.factor(d.bike.test.new$class))
Now the model can be fitted. The cost factor is deliberately set low, so that errors in the classification occur first, which are then to be eliminated with the tuning. At this point it should also be noted that the model in this example is now fitted with the training data record.
svm.bike.2 <- svm(y ~ .,
data = d.bike.train.svm,
kernel = "linear",
cost = 0.01,
metric = "RMSE",
scale = FALSE,
)
summary(svm.bike.2)
##
## Call:
## svm(formula = y ~ ., data = d.bike.train.svm, kernel = "linear",
## cost = 0.01, metric = "RMSE", , scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 154
##
## ( 43 32 59 16 4 )
##
##
## Number of Classes: 5
##
## Levels:
## 0 1 2 3 4
Now the classes are predicted once on the training and once on the test data set.
predict.svm.bike.2.train <- predict(svm.bike.2, d.bike.train.svm)
table(predict.svm.bike.2.train, d.bike.train.svm$y)
##
## predict.svm.bike.2.train 0 1 2 3 4
## 0 8011 0 0 0 0
## 1 0 3208 0 0 0
## 2 0 0 1254 5 0
## 3 0 0 0 413 0
## 4 0 0 0 0 143
predict.svm.bike.2.test <- predict(svm.bike.2, d.bike.test.svm)
table(predict.svm.bike.2.test, d.bike.test.svm$y)
##
## predict.svm.bike.2.test 0 1 2 3 4
## 0 2631 0 0 0 0
## 1 0 1133 0 0 0
## 2 0 0 402 2 0
## 3 0 0 0 142 0
## 4 0 0 0 0 35
Errors are made on both data sets. These must now be eliminated. To do this, a cost range is initialized as in the first example. The model is again optimized using the tune function.
cost_range <-
c(0.01,
0.1,
1,
5,
10,
100,
1000)
tune.out <- tune(
svm,
y~.,
data = d.bike.train.svm,
kernel = "linear",
ranges = list(cost = cost_range),
scale = FALSE
)
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.0001534919
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.0008439703 0.0009872952
## 2 1e-01 0.0004602404 0.0006467446
## 3 1e+00 0.0001534919 0.0003235894
## 4 5e+00 0.0001534919 0.0003235894
## 5 1e+01 0.0001534919 0.0003235894
## 6 1e+02 0.0001534919 0.0003235894
## 7 1e+03 0.0001534919 0.0003235894
Also in this example, a cost value of 1 seems to be sufficient to obtain an error rate of 0.
svm.bike.2.best <- tune.out$best.model
summary(svm.bike.2.best)
##
## Call:
## best.tune(method = svm, train.x = y ~ ., data = d.bike.train.svm,
## ranges = list(cost = cost_range), kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 58
##
## ( 15 13 23 5 2 )
##
##
## Number of Classes: 5
##
## Levels:
## 0 1 2 3 4
The optimal model shows the cost value 1. In this classification “only” 59 support vectors are generated. These are distributed very differently in the classes. The middle class 2 contains only one support vector.
Subsequently, the classes are to be predicted once on the training data set and once on the test data set. Based on the ratio of correct and incorrect classifications, the performance on both data sets will be evaluated.
predict.svm.bike.2.best.train <- predict(svm.bike.2.best, d.bike.train.svm)
table(predict.svm.bike.2.best.train, d.bike.train.svm$y)
##
## predict.svm.bike.2.best.train 0 1 2 3 4
## 0 8011 0 0 0 0
## 1 0 3208 0 0 0
## 2 0 0 1254 0 0
## 3 0 0 0 418 0
## 4 0 0 0 0 143
corrects=sum(predict.svm.bike.2.best.train==d.bike.train.svm$y)
errors=sum(predict.svm.bike.2.best.train!=d.bike.train.svm$y)
(performance_train=corrects/(corrects+errors))
## [1] 1
A performance rate of 100% is achieved on the training data record. This can already be seen in the table. All data is assigned to the correct classes.
predict.svm.bike.2.best.test <- predict(svm.bike.2.best, d.bike.test.svm)
table(predict.svm.bike.2.best.test, d.bike.test.svm$y)
##
## predict.svm.bike.2.best.test 0 1 2 3 4
## 0 2631 0 0 0 0
## 1 0 1133 0 0 0
## 2 0 0 401 0 0
## 3 0 0 1 144 0
## 4 0 0 0 0 35
corrects=sum(predict.svm.bike.2.best.test==d.bike.test.svm$y)
errors=sum(predict.svm.bike.2.best.test!=d.bike.test.svm$y)
(performance_test=corrects/(corrects+errors))
## [1] 0.9997699
The same result is shown on the test data set. The data is correctly classified using a new data set. An SVM can make good predictions for the two classification problems presented and the selected data set.
This last chapter creats the neuronal network prediction model train and test.
displayDigit <- function(X){
m <- matrix(unlist(X),nrow = 28,byrow = T)
m <- t(apply(m, 2, rev))
image(m,col=grey.colors(255))
}
set.seed(400)
(ref=sample(1:dim(d.bike.new)[1], 1, replace=F))
## [1] 9829
displayDigit(d.bike.new[ref,-1])
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
set.seed(13)
training <- d.bike.train.new[,-1]
test <- d.bike.test.new
test_total <- d.bike.new
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 10 hours
## H2O cluster timezone: Europe/Zurich
## H2O data parsing timezone: UTC
## H2O cluster version: 3.30.0.1
## H2O cluster version age: 2 months and 7 days
## H2O cluster name: H2O_started_from_R_taacece1_pal554
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.27 GB
## H2O cluster total cores: 16
## H2O cluster allowed cores: 16
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.2 (2019-12-12)
train_hf <- as.h2o(training)
##
|
| | 0%
|
|======================================================================| 100%
set.seed(105)
model_dl <- h2o.deeplearning(y = 4, training_frame = train_hf, nfolds = 10, seed=500, verbose = F)
##
|
| | 0%
|
|==== | 6%
|
|===== | 7%
|
|====== | 9%
|
|======= | 11%
|
|======== | 11%
|
|========= | 12%
|
|=========== | 16%
|
|============= | 18%
|
|=============== | 21%
|
|================ | 23%
|
|================= | 25%
|
|=================== | 28%
|
|====================== | 31%
|
|======================= | 33%
|
|======================== | 34%
|
|========================== | 38%
|
|=========================== | 39%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================= | 47%
|
|==================================== | 52%
|
|====================================== | 55%
|
|======================================== | 57%
|
|=========================================== | 61%
|
|============================================= | 65%
|
|================================================ | 69%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|======================================================== | 79%
|
|========================================================= | 81%
|
|========================================================== | 83%
|
|================================================================ | 91%
|
|==================================================================== | 97%
|
|======================================================================| 100%
# prediction on TRAIN
predicted_h20_deep_neural_network_probabilites_train <- h2o.predict(model_dl, train_hf)
##
|
| | 0%
|
|======================================================================| 100%
#table( as.data.frame(predicted_h20_deep_neural_network_probabilites_train$predict))
predicted_h20_deep_neural_network_class_train <- as.data.frame(predicted_h20_deep_neural_network_probabilites_train$predict)
#table(training$cnt, predicted_h20_deep_neural_network_class_train[,1])
predicted_h20_deep_neural_network_performance_train <- mean(predicted_h20_deep_neural_network_class_train[,1] == training$cnt)*100
print(paste('training accuracy: ',predicted_h20_deep_neural_network_performance_train,"%"))
## [1] "training accuracy: 1.01273592143624 %"
#prediction on TEST
test_hf <- as.h2o(test)
##
|
| | 0%
|
|======================================================================| 100%
predicted_h20_deep_neural_network_probabilites_test <- h2o.predict(model_dl, test_hf)
##
|
| | 0%
|
|======================================================================| 100%
#table( as.data.frame(predicted_h20_deep_neural_network_probabilites_test$predict))
predicted_h20_deep_neural_network_class_test <- as.data.frame(predicted_h20_deep_neural_network_probabilites_test$predict)
#table(test$cnt, as.integer(predicted_h20_deep_neural_network_class_test[,1]))
predicted_h20_deep_neural_network_performance_test <- mean(predicted_h20_deep_neural_network_class_test[,1] == test$cnt)*100
print(paste('test accuracy: ',predicted_h20_deep_neural_network_performance_test,"%"))
## [1] "test accuracy: 1.33486766398159 %"
set.seed(400)
(ref_test=sample(1:dim(test)[1], 4, replace=F))
## [1] 1637 3958 3580 1756
for (idx in ref_test){
real_class =test[idx,1]
prediction = predicted_h20_deep_neural_network_class_test[idx,1]
correct = (as.integer(real_class)==as.integer(prediction))
print(paste('Index ',idx,' class: ',(as.integer(real_class)-1)," --> predicted: ",prediction," [correct = ",correct,"]"))
displayDigit(test_total[idx,-1])
i <- readline(prompt="Press [enter] to continue")
}
## [1] "Index 1637 class: 3 --> predicted: 1 [correct = FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
## Press [enter] to continue
## [1] "Index 3958 class: 3 --> predicted: 1 [correct = FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
## Press [enter] to continue
## [1] "Index 3580 class: 2 --> predicted: 2 [correct = FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
## Press [enter] to continue
## [1] "Index 1756 class: 3 --> predicted: 1 [correct = FALSE ]"
## Warning in matrix(unlist(X), nrow = 28, byrow = T): data length [9] is not a
## sub-multiple or multiple of the number of rows [28]
## Press [enter] to continue
The purpose of this chapter is to compare all the models that have been used in this work to analyze and predict bike rentals. The RMSE is used as comparison value respectively measure of fit. To guarantee the fairness of the comparison, a 10-fold cross validation is applied.
lm.bike.1.rmse <- sqrt(mean(lm.bike.1$residuals^2))
gam.bike.1.rmse <- sqrt(mean(gam.bike.1$residuals^2))
poi.bike.1.rmse <- sqrt(mean(poi.bike.1$residuals^2))
#regression.tree.rmse <- mean((d.bike.test.new["cnt"]-tree.regression.bike.2.pred)$cnt)
#classification.tree.rmse <- mean((d.bike.test.new["class"]-tree.classification.bike.pred.test)$class)
bagging.rmse <- mean(yhat.bag-d.bike.test.new$cnt)
randomforest.rmse <- NA
boosting.rmse <- mean(yhat.boost -d.bike.test.new$cnt)
#svm.rmse <- rmse(svm.bike.2.best, d.bike.test.svm$y)